By Daniel Perruchoud and George Rowlands
This notebook delves into the exciting realm of Cleantech using a dataset of nearly 10,000 news articles from Kaggle, all centered around the energy sector. We'll embark on a journey that includes data exploration, text preprocessing, and culminates in the creation of a Retrieval-Augmented Generation Pipeline (RAG). This powerful approach empowers us to construct an LLM (Large Language Model) that can intelligently answer user queries, drawing upon the knowledge from our curated news articles.
Fine-tuning an LLM can be a resource-intensive and inflexible process. RAG offers a compelling alternative. It leverages a semantic search to pinpoint relevant sections within our news articles that directly address a user's question. These retrieved sections are then provided to the LLM as context, enabling it to deliver informed and insightful responses.

To run this notebook we recommend downloading the provided GitHub repository and opening this notebook in Google Colab. To ensure a smooth experience, you'll need:
At the start of the notebook a data.zip will be downloaded from a Google Drive and unzipped. This will then provide you with files that contain checkpoints for all of the expensive processing sections such as chunking, generating embeddings and evaluating the pipeline with an LLM as a judge. This saves you money and a lot of time.
If you can't or don't want to run this notebook you can also view the completed notebook by opening the cleantech_rag.html file in your browser.
Throughout this notebook, we'll delve into the intricate workings of RAG pipelines. Prepare to explore:
Questions or Issues? We're Here to Help!
If you encounter any roadblocks or have questions, please don't hesitate to reach out to George Rowlands
%%writefile requirements.txt
chromadb==0.5.0
datasets==2.19.1
gdown==5.2.0
kaggle==1.6.1
langchain==0.2.0
langchain-community==0.2.0
langchain-experimental==0.0.59
langchain-openai==0.1.7
langdetect==1.0.9
lorem-text==2.1
nbformat>=4.2.0
plotly==5.22.0
pretty-jupyter==1.0
ragas==0.1.8
seaborn==0.13.2
sentence-transformers==3.0.0
spacy>=3.7
textstat==0.7.3
umap-learn==0.5.5
Overwriting requirements.txt
%pip install torch==2.3.0 --quiet --index-url https://download.pytorch.org/whl/cu121
Note: you may need to restart the kernel to use updated packages.
%pip install -r ./requirements.txt --quiet
Note: you may need to restart the kernel to use updated packages.
import json
import os
import warnings
import zipfile
from collections import Counter
from pathlib import Path
from typing import Dict, List
import chromadb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import torch
from chromadb import Collection, Documents, EmbeddingFunction, Embeddings
from datasets import Dataset
from dotenv import load_dotenv
from langdetect import detect
from lorem_text import lorem
from ragas import RunConfig, evaluate
from ragas.metrics import (faithfulness, answer_relevancy, context_relevancy, answer_correctness)
from spacy.lang.en import English
from textstat import flesch_reading_ease
from tqdm import tqdm
import umap
from langchain.chains.base import Chain
from langchain.text_splitter import RecursiveCharacterTextSplitter, TextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma, VectorStore
from langchain_core.callbacks import CallbackManagerForRetrieverRun
from langchain_core.documents import Document
from langchain_core.embeddings import Embeddings
from langchain_core.language_models import LLM
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.retrievers import BaseRetriever
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
load_dotenv()
warnings.filterwarnings("ignore")
!gdown 1MoT_s_Zk4dzRRy7E7Va5ZuTROIOI1FfZ
with zipfile.ZipFile("data.zip", "r") as zip_file:
zip_file.extractall()
This OpenAI Key is used for the following tasks:
To set it rename the .env-example file to .env and add the key in the provided slot.
openai_key = "sk-XXXXXXXXXXXXXXXX"
To make sure our OpenAI Key is working we will test it by generating a response from GPT-4-turbo which we will later on also be using in our RAG pipeline. Try some different prompts or questions to see how the model responds.
llm = ChatOpenAI(model="gpt-3.5-turbo")
question_prompt = ChatPromptTemplate.from_template(
"Answer the following question: {question}")
question_chain = question_prompt | llm | StrOutputParser()
question_chain.invoke({"question": "What is the meaning of life?"})
'The meaning of life is a complex and subjective concept that varies from person to person. Some may believe that the meaning of life is to seek happiness and fulfillment, others may see it as a journey of self-discovery and personal growth, while others may find meaning in their relationships with others or in their contributions to society. Ultimately, the meaning of life is a deeply personal and individual question that each person must explore and define for themselves.'
We will be exploring the following Cleantech Media Dataset. If you have opened this notebook as recommended by opening the provided Github repository in Google Colab then you don't need to to download the dataset. It should already be under data/bronze. If not then you can either manually download it and upload it into a data/bronze folder or follow the steps below.
We will be using the Kaggle API to download the data.
To use the Kaggle API you will need a Kaggle account. If you don't already have one, sign up for a Kaggle account at https://www.kaggle.com. When you are logged in, go to the 'Settings' tab of your user profile https://www.kaggle.com/settings and select 'Create New Token'. This will trigger the download of kaggle.json, a file containing your API credentials.
You can then add your Kaggle username and key from the kaggle.json file to the .env file just like with the OpenAI Key.
data_folder = Path("./data")
if not data_folder.exists():
data_folder.mkdir()
bronze_folder = data_folder / "bronze"
if not bronze_folder.exists():
bronze_folder.mkdir()
%%script echo skipping
kaggle_user = "XXXXXXXXXXXXXXXX"
kaggle_key = "XXXXXXXXXXXXXXXX"
skipping
%%script echo skipping
os.system(f"kaggle datasets download -d jannalipenkova/cleantech-media-dataset -p {bronze_folder}")
skipping
%%script echo skipping
with zipfile.ZipFile(bronze_folder / "cleantech-media-dataset.zip", "r") as zip_file:
zip_file.extractall(bronze_folder)
skipping
articles_df = pd.read_csv(
bronze_folder / "cleantech_media_dataset_v2_2024-02-23.csv",
encoding='utf-8', index_col=0)
articles_df.head()
| title | date | author | content | domain | url | |
|---|---|---|---|---|---|---|
| 1280 | Qatar to Slash Emissions as LNG Expansion Adva... | 2021-01-13 | NaN | ["Qatar Petroleum ( QP) is targeting aggressiv... | energyintel | https://www.energyintel.com/0000017b-a7dc-de4c... |
| 1281 | India Launches Its First 700 MW PHWR | 2021-01-15 | NaN | ["β’ Nuclear Power Corp. of India Ltd. ( NPCIL)... | energyintel | https://www.energyintel.com/0000017b-a7dc-de4c... |
| 1283 | New Chapter for US-China Energy Trade | 2021-01-20 | NaN | ["New US President Joe Biden took office this ... | energyintel | https://www.energyintel.com/0000017b-a7dc-de4c... |
| 1284 | Japan: Slow Restarts Cast Doubt on 2030 Energy... | 2021-01-22 | NaN | ["The slow pace of Japanese reactor restarts c... | energyintel | https://www.energyintel.com/0000017b-a7dc-de4c... |
| 1285 | NYC Pension Funds to Divest Fossil Fuel Shares | 2021-01-25 | NaN | ["Two of New York City's largest pension funds... | energyintel | https://www.energyintel.com/0000017b-a7dc-de4c... |
human_eval_df = pd.read_csv(
bronze_folder / "cleantech_rag_evaluation_data_2024-02-23.csv",
encoding='utf-8', index_col=0)
human_eval_df.head()
| question_id | question | relevant_chunk | article_url | |
|---|---|---|---|---|
| example_id | ||||
| 1 | 1 | What is the innovation behind LeclanchΓ©'s new ... | LeclanchΓ© said it has developed an environment... | https://www.sgvoice.net/strategy/technology/23... |
| 2 | 2 | What is the EUβs Green Deal Industrial Plan? | The Green Deal Industrial Plan is a bid by the... | https://www.sgvoice.net/policy/25396/eu-seeks-... |
| 3 | 2 | What is the EUβs Green Deal Industrial Plan? | The European counterpart to the US Inflation R... | https://www.pv-magazine.com/2023/02/02/europea... |
| 4 | 3 | What are the four focus areas of the EU's Gree... | The new plan is fundamentally focused on four ... | https://www.sgvoice.net/policy/25396/eu-seeks-... |
| 5 | 4 | When did the cooperation between GM and Honda ... | What caught our eye was a new hookup between G... | https://cleantechnica.com/2023/05/08/general-m... |
As the saying goes, "garbage in, garbage out." In the realm of machine learning, the quality of our outputs hinges on the quality of our inputs. This section delves into the essential processes of Exploratory Data Analysis (EDA) and data preprocessing. Through EDA, we'll illuminate the characteristics, patterns, and potential quirks residing within our cleantech news article dataset. Preprocessing will ensure our data is cleansed, structured, and prepared to be effectively utilized by the RAG pipeline, laying the foundation for high-quality results.
Let us start by gaining an overview of the datasets features (columns).
articles_df.describe()
| title | date | author | content | domain | url | |
|---|---|---|---|---|---|---|
| count | 9593 | 9593 | 31 | 9593 | 9593 | 9593 |
| unique | 9569 | 967 | 7 | 9588 | 19 | 9593 |
| top | Cleantech Thought Leaders Series | 2023-05-04 | Michael Holder | ['Geopolitics as much as price or quality will... | cleantechnica | https://www.energyintel.com/0000017b-a7dc-de4c... |
| freq | 5 | 427 | 8 | 2 | 1861 | 1 |
articles_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 9593 entries, 1280 to 81816 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 title 9593 non-null object 1 date 9593 non-null object 2 author 31 non-null object 3 content 9593 non-null object 4 domain 9593 non-null object 5 url 9593 non-null object dtypes: object(6) memory usage: 524.6+ KB
Our initial exploration reveals that the "author" column only contains data for 31 out of 9593 articles. Since this offers minimal information gain, we can remove this feature.
We've also observed that some titles and content entries appear to be non-unique. This might necessitate identifying and removing duplicate entries.
On a positive note, the article URLs are all unique, potentially serving as suitable unique identifiers for the data.
articles_df = articles_df.drop(columns=["author"])
The dataset helpfully provides the domain names extracted from the article URLs. These domains essentially represent the publishers of the news articles. Let's analyze the distribution of publishers and see how many articles each publisher has contributed.
domain_counts = articles_df["domain"].value_counts()
domain_counts
domain cleantechnica 1861 azocleantech 1627 pv-magazine 1206 energyvoice 1017 solarindustrymag 673 naturalgasintel 658 thinkgeoenergy 645 rechargenews 559 solarpowerworldonline 505 energyintel 234 pv-tech 232 businessgreen 158 greenprophet 80 ecofriend 38 solarpowerportal.co 34 eurosolar 28 decarbxpo 19 solarquarter 17 indorenergy 2 Name: count, dtype: int64
barplot = sns.barplot(
x=domain_counts.values,
y=domain_counts.index,
hue=domain_counts.index
)
barplot.set_title('Article Counts by Domain')
barplot.set_xlabel('Article Count')
barplot.set_ylabel('Domain')
plt.show()
Our exploration of article domains reveals a skewed distribution. Publishers like cleantechnica have a significantly higher representation (1861 articles), while others like indoenergy have minimal contributions (2 articles). If we proceed with sampling this data, this imbalance should be taken into account. Stratified sampling could be a viable approach to ensure a representative sample across different publishers.
Each article within the dataset is accompanied by a publication date. Let's delve into the temporal range of these articles and investigate any noteworthy patterns in publication trends.
# plot the amount of articles over time
articles_df["date"] = pd.to_datetime(articles_df["date"])
time_df = articles_df.groupby("date").size().reset_index()
time_df.columns = ["date","count"]
time_df.describe()
| date | count | |
|---|---|---|
| count | 967 | 967.000000 |
| mean | 2022-06-01 19:11:06.390899456 | 9.920372 |
| min | 2021-01-01 00:00:00 | 1.000000 |
| 25% | 2021-09-11 12:00:00 | 4.000000 |
| 50% | 2022-06-06 00:00:00 | 9.000000 |
| 75% | 2023-02-14 12:00:00 | 13.000000 |
| max | 2023-12-05 00:00:00 | 427.000000 |
| std | NaN | 15.206340 |
sns.lineplot(data=time_df, x="date", y="count")
plt.title("Article Count Over Time")
plt.xlabel("Date")
plt.xticks(rotation=90)
plt.ylabel("Article Count")
# add a line for the average
avg_count = time_df["count"].mean()
plt.axhline(avg_count, color='r', linestyle='--', label=f"Average article count per day: {avg_count:.2f}")
plt.legend()
plt.show()
While the daily article count appears consistent overall, a significant outlier disrupts the pattern on the 2023-12-05. The cause of this outlier is undetermined, but it could potentially be the date the data was scraped and the default value assigned for missing dates. Since the publication date is not crucial for RAG pipeline, we can remove it.
articles_df = articles_df.drop(columns=["date"])
As noted in our initial exploration, some articles share identical titles. Here, we'll focus on identifying and handling these duplicate titles to ensure a clean and consistent dataset for our RAG pipeline.
sns.histplot(articles_df["title"].str.len())
plt.title("Title Length Distribution")
plt.xlabel("Title Length")
plt.ylabel("Count")
avg_count = articles_df["title"].str.len().mean()
plt.axvline(avg_count, color='r', linestyle='--', label=f"Average title length: {avg_count:.2f}")
plt.legend()
plt.show()
articles_df["title"].duplicated().sum()
24
duplicate_titles = articles_df[articles_df["title"].duplicated(keep=False)].sort_values("title")
duplicate_titles.head(10)
| title | content | domain | url | |
|---|---|---|---|---|
| 6654 | Aberdeenβ s NZTC plans national centre for geo... | ['Aberdeenβ s NZTC is planning a national cent... | energyvoice | https://www.energyvoice.com/renewables-energy-... |
| 6660 | Aberdeenβ s NZTC plans national centre for geo... | ['Aberdeenβ s NZTC is planning a national cent... | energyvoice | https://sgvoice.energyvoice.com/strategy/techn... |
| 38593 | About David J. Cross | ["By clicking `` Allow All '' you agree to the... | azocleantech | https://www.azocleantech.com/authors/david-cross |
| 38599 | About David J. Cross | ["By clicking `` Allow All '' you agree to the... | azocleantech | https://www.azocleantech.com/authors/david-cro... |
| 38596 | About David J. Cross | ["By clicking `` Allow All '' you agree to the... | azocleantech | https://www.azocleantech.com/authors/david-cro... |
| 38598 | About David J. Cross | ["By clicking `` Allow All '' you agree to the... | azocleantech | https://www.azocleantech.com/authors/david-cro... |
| 38597 | About David J. Cross | ["By clicking `` Allow All '' you agree to the... | azocleantech | https://www.azocleantech.com/authors/david-cro... |
| 6704 | BEIS mulls ringfenced CfD support for geotherm... | ['Ministers are considering whether geothermal... | energyvoice | https://sgvoice.energyvoice.com/policy/21121/b... |
| 6702 | BEIS mulls ringfenced CfD support for geotherm... | ['Ministers are considering whether geothermal... | energyvoice | https://www.energyvoice.com/renewables-energy-... |
| 37040 | Cleantech Insights from Industry Series | ["By clicking `` Allow All '' you agree to the... | azocleantech | https://www.azocleantech.com/Insights.aspx?page=2 |
duplicate_titles["content"].duplicated().sum()
0
Our exploration identified 24 titles that appear multiple times in the dataset. Examples include "About David J. Cross." Interestingly, while the titles are identical, the content itself appears to be unique.
Here are some additional observations for further investigation:
def wrap_text(text: str, char_per_line=100) -> str:
# for better readability, wrap the text at the last space before the char_per_line
if len(text) < char_per_line:
return text
else:
return text[:char_per_line].rsplit(' ', 1)[0] + '\n' + wrap_text(text[len(text[:char_per_line].rsplit(' ', 1)[0])+1:], char_per_line)
print(duplicate_titles.iloc[0]["title"])
print(wrap_text(duplicate_titles.iloc[0]["content"]))
Aberdeenβ s NZTC plans national centre for geothermal energy ['Aberdeenβ s NZTC is planning a national centre to accelerate geothermal energy in the UK and become the β go-to β hub globally for the renewable technology.', 'Calum Watson, senior project engineer at the Net Zero Technology Centre, has outlined ambitions for the oil and gas industry to help ramp up the clean energy, both onshore and in the North Sea.', 'The NZTCβ s new β National Geothermal Innovation Centre β would develop technology and help create β bespoke regulation β for geothermal, with the aim of it providing 5% of UK energy needs by 2030.', 'By 2050, Mr Watson said geothermal could account for 20% of Britainβ s energy mix, slashing carbon emissions in the process.', 'Geothermal is a burgeoning technology β which has been picked up in some countries like Iceland and the Philippines β which harnesses heat in the subsurface of the earth to generate electricity.', 'Some barriers to its uptake include expensive up-front costs like exploration and drilling.', 'However a report published this week by trade body Offshore Energies UK said there are 2,100 offshore oil and gas wells to be decommissioned in the North Sea next decade β which Mr Watson described as a β massive opportunity β for geothermal', 'Based at a β north-east location β, the new hub would be the β go to centre globally for geothermal technology challenges but, crucially, would be world-leading in supporting government, and creating legislation and best practice for geothermal β.', 'Speaking at the Offshore Decommissioning Conference in St Andrews on Tuesday, Mr Watson did not disclose whether the plan had backers or when it might be set up.', 'He said it would be achieved through a β partner-led roadmap β akin to the NZTC itself β which is funded with Β£180m of UK and Scottish Government funding β and ultimately be powered by geothermal energy.', 'The national base would comprise a β solution centre β to scale up technologies from pilot stage.', 'It would also have a knowledge hub to share learnings and an β accelerator programme β to fund start-ups.', 'The NZTC has already dipped its toe into the tech β supporting a β first of its kind β test project for the EnQuest Magnus platform in the North Sea.', 'Mr Watson set out his hopes for what the centre could achieve by 2030, and highlighted the opportunity for oil and gas workers to transfer to the sustainable technology.', 'β ( By 2030) we want the centre to have delivered geothermal energy, accounting for 5% of the UKβ s energy mix and on route for 20% by 2050.', 'β We would have multiple demonstrators successfully delivered to showcase and educate and, long term, the center will be run on geothermal energy.']
print(duplicate_titles.iloc[1]["title"])
print(wrap_text(duplicate_titles.iloc[1]["content"]))
Aberdeenβ s NZTC plans national centre for geothermal energy ['Aberdeenβ s NZTC is planning a national centre to accelerate geothermal energy in the UK and become the β go-to β hub globally for the renewable technology.', 'Calum Watson, senior project engineer at the Net Zero Technology Centre, has outlined ambitions for the oil and gas industry to help ramp up the clean energy, both onshore and in the North Sea.', 'The NZTCβ s new β National Geothermal Innovation Centre β would develop technology and help create β bespoke regulation β for geothermal, with the aim of it providing 5% of UK energy needs by 2030.', 'By 2050, Mr Watson said geothermal could account for 20% of Britainβ s energy mix, slashing carbon emissions in the process.', 'Geothermal is a burgeoning technology β which has been picked up in some countries like Iceland and the Philippines β which harnesses heat in the subsurface of the earth to generate electricity.', 'Some barriers to its uptake include expensive up-front costs like exploration and drilling.', 'However a report published this week by trade body Offshore Energies UK said there are 2,100 offshore oil and gas wells to be decommissioned in the North Sea next decade β which Mr Watson described as a β massive opportunity β for geothermal', 'Based at a β north-east location β, the new hub would be the β go to centre globally for geothermal technology challenges but, crucially, would be world-leading in supporting government, and creating legislation and best practice for geothermal β.', 'Speaking at the Offshore Decommissioning Conference in St Andrews on Tuesday, Mr Watson did not disclose whether the plan had backers or when it might be set up.', 'He said it would be achieved through a β partner-led roadmap β akin to the NZTC itself β which is funded with Β£180m of UK and Scottish Government funding β and ultimately be powered by geothermal energy.', 'The national base would comprise a β solution centre β to scale up technologies from pilot stage.', 'It would also have a knowledge hub to share learnings and an β accelerator programme β to fund start-ups.', 'The NZTC has already dipped its toe into the tech β supporting a β first of its kind β test project for the EnQuest Magnus platform in the North Sea.', 'Mr Watson set out his hopes for what the centre could achieve by 2030, and highlighted the opportunity for oil and gas workers to transfer to the sustainable technology.', 'β ( By 2030) we want the centre to have delivered geothermal energy, accounting for 5% of the UKβ s energy mix and on route for 20% by 2050.']
Our analysis suggests potential redundancy within certain articles. In some cases, the second article might appear to be the first article with an additional sentence appended at the end.
Let's take a closer look at these "energyvoice" articles and how the contents start and see if we can eliminate these redundancies.
energyvoice_articles = articles_df[articles_df["domain"].str.contains("energyvoice")]
energyvoice_articles.content.map(lambda x: x[:50]).value_counts()
content
['', '', 'The Megawatt Hour is the latest podcast 6
['A group of trade associations from across the en 3
['Two years after the Amazon Pledge Fund invested 3
['The latest analysis shows that capital flows tow 2
['Macquarie Group is betting the North Sea β engin 2
..
['Now more than ever β in terms of cost and the im 1
['Scientists have hailed a helium discovery which 1
['Marine equipment fabrication and rental speciali 1
['The Russian powers behind oil explorers Exillon 1
['Aberdeen-headquartered Repsol Sinopec Resources 1
Name: count, Length: 980, dtype: int64
def remove_prefix_articles(df: pd.DataFrame, prefix_len: int = 100) -> pd.DataFrame:
"""
Takes O(n^2) time complexity
If the first {prefix_len} characters of the article are the same, then we consider them as a prefix.
If an article is a prefix of a longer article, then we remove it.
If an article is a prefix of longer article, but they have different titles, then we keep them.
"""
df["char_len"] = df["content"].map(len)
df = df.sort_values(by='char_len', ascending=True).reset_index(drop=True)
# Initialize a list to keep the articles that are not prefixes of others
non_prefix_articles = []
for i, row in df.iterrows():
is_prefix = False
content_i = row['content'][:prefix_len]
title_i = row['title']
for j in range(i + 1, len(df)):
content_j = df.at[j, 'content'][:prefix_len]
title_j = df.at[j, 'title']
if content_i == content_j:
# If the prefix matches but the titles are different, we keep it
if title_i != title_j:
continue
else:
is_prefix = True
break
if not is_prefix:
non_prefix_articles.append(row)
print(f"Removed {len(df) - len(non_prefix_articles)} prefix articles")
return pd.DataFrame(non_prefix_articles)
energyvoice_articles = remove_prefix_articles(energyvoice_articles)
energyvoice_articles.content.map(lambda x: x[:100]).value_counts()
Removed 11 prefix articles
content
['', '', 'The Megawatt Hour is the latest podcast boxset brought to you by Energy Voice Out Loud in 6
['Two years after the Amazon Pledge Fund invested in Hippo Harvest, the company is selling its first 3
['A group of trade associations from across the energy sector have written to the Chancellor urging 3
['Global Port Services has confirmed the award of multiple contracts in support of the Seagreen wind 2
['DNV report shows Jotunβ s Baltoflake solution offers beyond 30 yearsβ protection for offshore asse 2
..
['The deal volume for renewable energy assets in Asia more than tripled to $ 13.6 billion in 2021, a 1
['Several young energy professionals have undertaken a voyage across Scotland to spotlight the count 1
['A UK-backed research group unveiled a design for a liquid hydrogen-powered airliner theoretically 1
['UK-listed Pharos Energy is excited about its upcoming Vietnam activities with a 3D seismic shoot l 1
['With the greatest and most urgent energy transition in human history accelerating, the quest for n 1
Name: count, Length: 981, dtype: int64
There still seem to be be some redundancy, but we did manage to remove 11 duplicates.
Having explored various aspects of our dataset, we now turn our attention to the heart of the matter: the article content itself. This section will delve into the analysis and preprocessing techniques we'll employ to ensure the content is high-quality and effectively utilized by our RAG pipeline.
np.random.seed(7)
random_sample_id = np.random.choice(articles_df.index)
print(wrap_text(articles_df.loc[random_sample_id, "content"]))
['Enphase Energy Inc., a supplier of microinverter-based solar and battery systems, says its partner Lumio will be significantly expanding its offering of Enphase IQ8 microinverters and IQ batteries to customers across the United States.', 'The strategic relationship with Lumio will amplify the impact and distribution of Enphase systems, providing homeowners more access to reliable, sustainable and grid-independent power sources, the company says.', 'β We are excited about Enphaseβ s full suite of products β including microinverters, batteries and EV chargers β that can provide our customers best-in-class home energy management solutions, β says Greg Butterfield, CEO at Lumio. β Additionally, the Enphase digital platform, from lead generation to permitting to ongoing operations and maintenance services, offers a unique ability for Lumio to increase efficiencies and reduce costs. β', 'For homeowners who want battery backup, there are no sizing restrictions on pairing Enphase IQ batteries with IQ8 microinverters, and the Sunlight Jump Start feature can restart a home energy system β switching to sunlight-only after prolonged grid outages that may result in a fully depleted battery. This eliminates the need for a manual restart of the system and gives homeowners greater assurance of energy resilience.', 'β This strategic relationship with Enphase makes it easier for Lumioβ s customers to take control of their power production, power consumption, and increase the security and reliability of their familyβ s power supply, β adds David Schonberg, senior vice president of energy partnerships at Lumio.', 'Solar Industry offers industry participants probing, comprehensive assessments of the technology, tools and trends that are driving this dynamic energy sector. From raw materials straight through to end-user applications, we capture and analyze the critical details that help professionals stay current and navigate the solar market.', 'Β© Copyright Zackin Publications Inc. All Rights Reserved.']
Our initial examination reveals that article content is currently stored as a list of strings. To gain deeper understanding and facilitate preprocessing, we'll transform these lists into a more cohesive textual format.
articles_df['article'] = articles_df['content'].apply(lambda x: ' '.join(eval(x)))
print(wrap_text(articles_df.loc[random_sample_id, "article"]))
Enphase Energy Inc., a supplier of microinverter-based solar and battery systems, says its partner Lumio will be significantly expanding its offering of Enphase IQ8 microinverters and IQ batteries to customers across the United States. The strategic relationship with Lumio will amplify the impact and distribution of Enphase systems, providing homeowners more access to reliable, sustainable and grid-independent power sources, the company says. β We are excited about Enphaseβ s full suite of products β including microinverters, batteries and EV chargers β that can provide our customers best-in-class home energy management solutions, β says Greg Butterfield, CEO at Lumio. β Additionally, the Enphase digital platform, from lead generation to permitting to ongoing operations and maintenance services, offers a unique ability for Lumio to increase efficiencies and reduce costs. β For homeowners who want battery backup, there are no sizing restrictions on pairing Enphase IQ batteries with IQ8 microinverters, and the Sunlight Jump Start feature can restart a home energy system β switching to sunlight-only after prolonged grid outages that may result in a fully depleted battery. This eliminates the need for a manual restart of the system and gives homeowners greater assurance of energy resilience. β This strategic relationship with Enphase makes it easier for Lumioβ s customers to take control of their power production, power consumption, and increase the security and reliability of their familyβ s power supply, β adds David Schonberg, senior vice president of energy partnerships at Lumio. Solar Industry offers industry participants probing, comprehensive assessments of the technology, tools and trends that are driving this dynamic energy sector. From raw materials straight through to end-user applications, we capture and analyze the critical details that help professionals stay current and navigate the solar market. Β© Copyright Zackin Publications Inc. All Rights Reserved.
articles_df["article"].duplicated().sum()
5
duplicate_articles = articles_df[articles_df["article"].duplicated(keep=False)].sort_values("article")
duplicate_articles
| title | content | domain | url | article | |
|---|---|---|---|---|---|
| 78215 | China's wind giants are chasing global growth:... | ['Geopolitics as much as price or quality will... | rechargenews | https://www.rechargenews.com/wind/chinas-wind-... | Geopolitics as much as price or quality will d... |
| 78216 | Why geopolitics will set the limits of China's... | ['Geopolitics as much as price or quality will... | rechargenews | https://www.rechargenews.com/wind/why-geopolit... | Geopolitics as much as price or quality will d... |
| 80067 | Sodium-ion battery production capacity to grow... | ['Global demand for sodium-ion batteries is ex... | pv-magazine | https://www.pv-magazine.com/2023/07/17/sodium-... | Global demand for sodium-ion batteries is expe... |
| 80073 | Sodium-ion battery fleet to grow to 10 GWh by ... | ['Global demand for sodium-ion batteries is ex... | pv-magazine | https://www.pv-magazine.com/2023/07/17/sodium-... | Global demand for sodium-ion batteries is expe... |
| 6685 | Indonesia seeks investors for giant geothermal... | ['Indonesia, home to the worldβ s largest geot... | energyvoice | https://www.energyvoice.com/oilandgas/467719/i... | Indonesia, home to the worldβ s largest geothe... |
| 6689 | Indonesia seeks investors for giant geothermal... | ['Indonesia, home to the worldβ s largest geot... | energyvoice | https://sgvoice.energyvoice.com/investing/2002... | Indonesia, home to the worldβ s largest geothe... |
| 78225 | Quest for endless green energy from Earth's co... | ['One of Japanβ s largest utility groups Chubu... | rechargenews | https://www.rechargenews.com/energy-transition... | One of Japanβ s largest utility groups Chubu E... |
| 78227 | Limitless green energy from Earth's core quest... | ['One of Japanβ s largest utility groups Chubu... | rechargenews | https://www.rechargenews.com/news/2-1-1487279 | One of Japanβ s largest utility groups Chubu E... |
| 78210 | Portugal energy transition plan targets massiv... | ['Portugal has more than doubled its 2030 goal... | rechargenews | https://www.rechargenews.com/energy-transition... | Portugal has more than doubled its 2030 goals ... |
| 78212 | Wind, hydrogen and solar fused in Portugal's p... | ['Portugal has more than doubled its 2030 goal... | rechargenews | https://www.rechargenews.com/energy-transition... | Portugal has more than doubled its 2030 goals ... |
Our analysis uncovers additional insights regarding content duplication. We observe cases where seemingly identical articles are reposted on the same domain but with different titles (excluding the "sgvoice.energyvoice.com" vs. "energyvoice.com" scenario previously addressed). Here, we'll strategically keep these duplicates where contents are the same but titles are different.
Importance of Titles
We keep these duplicate articles because titles can hold significant relevance for our RAG pipeline. Consider a scenario where a user query uses an abbreviation, while the corresponding article only contains the abbreviation in the title, in the content always the full term is used. To bridge this gap, we'll prepend titles to the article content during preprocessing. This ensures that the retrieval process considers not only the content itself, but also the potentially informative titles.
Next Step
As previously noted, some articles exhibit standardized introductions, possibly artifacts of the data scraping process. We'll develop appropriate techniques to handle these introductions during preprocessing, ensuring they don't hinder the effectiveness of our RAG pipeline.
articles_df.article.map(lambda x: x[:50]).value_counts()
article
By clicking `` Allow All '' you agree to the stori 1627
Sign in to get the best natural gas news and data. 658
window.dojoRequire ( [ `` mojo/signup-forms/Loader 52
None of these red flags by themselves make a compa 19
Volkswagen ID.4 sales were up 254% in the 1st quar 14
...
You want to invest in renewable energy or a better 1
The best way to deal with carbon is not to release 1
When there is deflation, the prices of goods in th 1
Stickers are excellent products to leverage in bot 1
Arevon Energy Inc. has closed financing on the Vik 1
Name: count, Length: 6765, dtype: int64
artifacts = [
"By clicking `` Allow All '' you agree to the sto",
"Sign in to get the best natural gas news and dat",
"window.dojoRequire ( [ `` mojo/signup-forms/Load"
]
for artifact in artifacts:
print(wrap_text(articles_df[articles_df.article.str.startswith(artifact)].article.iloc[0][:500]))
print()
By clicking `` Allow All '' you agree to the storing of cookies on your device to enhance site
navigation, analyse site usage and support us in providing free open access scientific content.
More info. Nel Hydrogen is committed to pushing the boundaries of science and continues to support
the research and development of new and innovative technologies. A group of leading researchers and
two employees of Proton Energy Systems, Inc., a subsidiary of Nel ASA ( Nel Hydrogen) have recently
published
Sign in to get the best natural gas news and data. Follow the topics you want and receive the daily
emails. Your email address * Your password * Remember me Continue Reset password Featured Content
News & Data Services Client Support Bidweek Markets | Natural Gas Prices | NGI All News Access
Major fluctuations in the latest weather models resulted in big swings in natural gas bidweek
prices, with solid gains on the East Coast and out West. However, much of the countryβ s midsection
posted hefty
window.dojoRequire ( [ `` mojo/signup-forms/Loader '' ], function ( L) { L.start ( { `` baseUrl '':
'' mc.us4.list-manage.com '', '' uuid '': '' 2a6df7ce0f3230ba1f5efe12c '', '' lid '': '' 1e23cc3ebd
'', '' uniqueMethods '': true }) }) American consumers are more concerned about the planet than
steady economic growth, new report. Your company wants to be a part of this. What steps do you
take? Each company should create detailed reports that evaluate the environmental impact of the
business, num
def remove_scrapping_artifacts(df: pd.DataFrame, column: str) -> pd.DataFrame:
text_artifacts = [
"By clicking `` Allow All '' you agree to the storing of cookies on your device to enhance site navigation, analyse site usage and support us in providing free open access scientific content. More info.",
"Sign in to get the best natural gas news and data. Follow the topics you want and receive the daily emails. Your email address * Your password * Remember me Continue Reset password Featured Content News & Data Services Client Support"
]
regex_artifacts = [
r"window.dojoRequire \( \[ .*\}\) \}\) "
]
for pattern in text_artifacts:
articles_df[column] = articles_df[column].str.replace(pattern, '', regex=False)
for pattern in regex_artifacts:
articles_df[column] = articles_df[column].str.replace(pattern, '', regex=True)
return df
articles_df = remove_scrapping_artifacts(articles_df, "article")
articles_df.article.map(lambda x: x[:50]).value_counts()
article
Daily GPI Energy Transition | Infrastructure | NG 38
Daily GPI E & P | NGI All News Access The U.S. na 36
Daily GPI Energy Transition | NGI All News Access 28
None of these red flags by themselves make a compa 19
Daily GPI Markets | Natural Gas Prices | NGI All 17
..
Award winning cleantech firm Aceleronβ s repairab 1
Generating safe, green energy is one thing but pr 1
Countries around the world need to move further a 1
The sun is arguably the most important renewable 1
Arevon Energy Inc. has closed financing on the Vik 1
Name: count, Length: 8749, dtype: int64
Our efforts have successfully eliminated a substantial portion of the scrapping artifacts within the articles. However, some traces still persist, likely remnants of past website navigation structures. While removing these remaining artifacts could offer further refinement, it also presents a significant challenge. Therefore, we'll acknowledge this for now and move onto further preprocessing such as filtering out articles that are not in english.
articles_df["lang"] = articles_df["article"].map(detect)
articles_df["lang"].value_counts()
lang en 9588 de 4 ru 1 Name: count, dtype: int64
articles_df[articles_df["lang"] != "en"]
| title | content | domain | url | article | lang | |
|---|---|---|---|---|---|---|
| 8283 | International Energy Storage Conference ( IRES... | ['EUROSOLAR veranstaltet vom 16. bis 18. MΓ€rz ... | eurosolar | https://www.eurosolar.de/2021/01/26/internatio... | EUROSOLAR veranstaltet vom 16. bis 18. MΓ€rz 20... | de |
| 8304 | Open Letter to Presidents Putin, Biden, Zelens... | ['EUROSOLAR, the European Association for Rene... | eurosolar | https://www.eurosolar.de/sektionen/russland/ | EUROSOLAR, the European Association for Renewa... | ru |
| 8307 | Internationale Konferenz fΓΌr Energiespeicher m... | ['Die nun zu Ende gegangene β Internationale E... | eurosolar | https://www.eurosolar.de/2022/09/26/internatio... | Die nun zu Ende gegangene β Internationale Ern... | de |
| 8308 | Presentations, Poster and Photos of the IRES 2022 | ['Photos from the IRES ( Copyright EUROSOLAR e... | eurosolar | https://www.eurosolar.de/2022/10/20/presentati... | Photos from the IRES ( Copyright EUROSOLAR e.V... | de |
| 24652 | SMS group liefert Prozesstechnologie fΓΌr das e... | ['Β© SMS group liefert Prozesstechnologie fΓΌr d... | decarbxpo | https://www.decarbxpo.com/en/News_Media/Magazi... | Β© SMS group liefert Prozesstechnologie fΓΌr das... | de |
print(wrap_text(articles_df[articles_df["lang"] != "en"].iloc[1]["article"][1000:]))
suffering and misery for over a century, while distracting from the one common enemy threatening to consume all: accelerated fossil fueled climate heating. The Ukraineβ s EUROSOLAR section and its networks have long advocated a new age with renewable energy in Eastern Europe. Together with all of our other sections and members across the European continent, from Russia to the Netherlands, and from Turkey to Denmark, EUROSOLAR offers this Climate Peace Platform. Prof. Peter Droege, President of EUROSOLAR: β The time has come for Climate Peace Diplomacy, to confront everyoneβ s common enemy: advanced fossil climate destabilization. This is one of ten actions presented by EUROSOLAR as the main agenda of our time. β Dr. Brigitte Schmidt, Vice President and Board Member of EUROSOLAR Germany: β The time for renewable peace has come, part of our Regenerative Earth Decade program. It stands for rethinking and peaceful action for our common future.β Since its very foundation in 1988 EUROSOLAR has worked to end fossil fuel wars through the great switch to 100% renewable energy. In the words of Hermann Scheer ( 1944-2010), founder of EUROSOLAR: β Renewable energies build peaceβ. The age of fossil-nuclear threats must end, the existential focus must begin: www.earthdecade.org. EUROSOLAR also calls for a shift in thinking towards climate peace diplomacy that recognizes and combats fossil dependencies as humanityβ s greatest common enemy. https: //www.eurosolar.org/en/2022/02/01/regenerative-earth-decade-eurosolars-call-for-climate-peace-diplom cy/ ΠΡΠ΄ΠΊΡΠΈΡΠΈΠΉ Π»ΠΈΡΡ ΠΏΡΠ΅Π·ΠΈΠ΄Π΅Π½ΡΠ°ΠΌ ΠΡΡΡΠ½Ρ, ΠΠ°ΠΉΠ΄Π΅Π½, ΠΠ΅Π»Π΅Π½ΡΡΠΊΠΈΠΉ Ρ ΠΡΠΊΠ°ΡΠ΅Π½ΠΊΠΎ: Eurosolar, ΠΠ²ΡΠΎΠΏΠ΅ΠΉΡΡΠΊΠ° Π°ΡΠΎΡΡΠ°ΡΡΡ Π²ΡΠ΄Π½ΠΎΠ²Π»ΡΠ²Π°Π½ΠΎΡ Π΅Π½Π΅ΡΠ³Π΅ΡΠΈΠΊΠΈ, Π·Π°ΠΊΠ»ΠΈΠΊΠ°Ρ Π΄ΠΎ Π½Π΅Π³Π°ΠΉΠ½ΠΎΠ³ΠΎ ΠΏΡΠΈΠΏΠΈΠ½Π΅Π½Π½Ρ Π²ΠΎΠ³Π½Ρ ΡΠ° ΠΏΠΎΡΡΡΠΉΠ½ΠΎΡ ΠΌΠΈΡΠ½ΠΎΡ ΡΠ³ΠΎΠ΄ΠΈ ΠΏΠΎ Π²ΡΡΠΉ Π‘Ρ ΡΠ΄Π½ΡΠΉ ΠΠ²ΡΠΎΠΏΡ, Π±Π΅ΡΡΡΠΈ ΡΡΠ°ΡΡΡ Ρ Π²ΡΠ΅ΡΡΠΎΡΠΎΠ½Π½ΡΠΉ ΠΊΠ»ΡΠΌΠ°ΡΠΈΡΠ½ΡΠΉ ΠΌΠΈΡΠ½ΡΠΉ Π΄ΠΈΠΏΠ»ΠΎΠΌΠ°ΡΡΡ. ΠΠ°ΠΏΠ°Π΄ ΡΠΎΡΡΠΉΡΡΠΊΠΈΡ Π²ΡΠΉΡΡΠΊΠΎΠ²ΠΈΡ Π½Π° ΡΠΊΡΠ°ΡΠ½ΡΡΠΊΠΈΠΉ Π½Π°ΡΠΎΠ΄ Ρ ΠΉΠΎΠ³ΠΎ ΡΡΡΠ΄ ΠΏΠΎΠ²ΠΈΠ½Π΅Π½ Π±ΡΡΠΈ Π·Π°ΡΡΠ΄ΠΆΠ΅Π½ΠΈΠΉ Π½Π°ΠΉΡΡΡΡΡΡΡΠΈΠΌ ΡΠΈΠ½ΠΎΠΌ Ρ ΠΏΠΎΠ²ΠΈΠ½Π΅Π½ Π½Π΅Π³Π°ΠΉΠ½ΠΎ ΠΏΡΠΈΠΏΠΈΠ½ΠΈΡΠΈΡΡ. ΠΡΡ ΠΊΡΠ°ΡΠ½ΠΈ, ΡΠΊΡ Π²ΠΈΠΊΠΎΡΠΈΡΡΠΎΠ²ΡΡΡΡ Π²ΡΠΉΡΡΠΊΠΎΠ²Ρ Π°Π»ΡΡΠ½ΡΠΈ Π΄Π»Ρ ΠΏΠΎΡΡΡΠΉΠ½ΠΎΠ³ΠΎ ΠΊΠΎΡΠΈΠ³ΡΠ²Π°Π½Π½Ρ ΡΡΠ΅Ρ ΡΠ½ΡΠ΅ΡΠ΅ΡΡΠ² Ρ ΠΏΠΎΡΡΡΠΉΠ½ΠΎ ΠΆΠΎΠΊΠ΅Ρ Π΄Π»Ρ ΡΠ°ΠΊΡΠΈΡΠ½ΠΈΡ Ρ ΡΡΡΠ°ΡΠ΅Π³ΡΡΠ½ΠΈΡ ΠΏΠ΅ΡΠ΅Π²Π°Π³, ΠΏΠΎΠ²ΠΈΠ½Π½Ρ ΠΏΡΠΈΠΏΠΈΠ½ΠΈΡΠΈ ΡΠ²ΠΎΡ Π΄Π΅ΡΡΠ°Π±ΡΠ»ΡΠ·ΡΡΡΡ ΠΏΡΠ°ΠΊΡΠΈΠΊΡ. ΠΡΡ ΡΡΠΎΡΠΎΠ½ΠΈ ΠΏΠΎΠ²ΠΈΠ½Π½Ρ ΠΏΡΠΎΠΊΠΈΠ½ΡΡΠΈΡΡ: ΠΌΠΈ Π½Π΅ ΡΡΠ»ΡΠΊΠΈ Π²ΡΡ Π΄ΠΈΠ²Π»ΡΠΌΠΎΡΡ Π² ΡΠ΄Π΅ΡΠ½Ρ ΠΏΡΡΡΠ²Ρ ΡΠ΅ΡΠ΅Π· ΡΡΠΈΠ²Π°Π»Ρ Π½Π΅Π²Π΄Π°Π»Ρ ΡΠΏΡΠΎΠ±ΠΈ ΡΠΎΠ·Π·Π±ΡΠΎΡΠ½Π½Ρ β ΠΏΠ»Π°Π½Π΅ΡΠ° ΡΠ°ΠΊΠΎΠΆ Π·Π½Π°Ρ ΠΎΠ΄ΠΈΡΡΡΡ Π² Π»Π΅ΡΠ°ΡΠ°Ρ Π½Π΅ΠΊΠΎΠ½ΡΡΠΎΠ»ΡΠΎΠ²Π°Π½ΠΎΡ ΠΊΠ»ΡΠΌΠ°ΡΠΈΡΠ½ΠΎΡ ΡΠΏΡΡΠ°Π»Ρ, ΡΠΊΠ° ΠΏΡΠ°ΠΊΡΠΈΡΠ½ΠΎ Π½Π°ΠΏΠ΅Π²Π½ΠΎ Π·ΡΠΎΠ±ΠΈΡΡ ΡΡ Π½Π΅ΠΏΡΠΈΠ΄Π°ΡΠ½ΠΎΡ Π΄Π»Ρ ΠΆΠΈΡΡΡ Π² ΡΡΠΎΠΌΡ ΠΏΠΎΠΊΠΎΠ»ΡΠ½Π½Ρ. Eurosolar, ΠΠ²ΡΠΎΠΏΠ΅ΠΉΡΡΠΊΠ° Π°ΡΠΎΡΡΠ°ΡΡΡ Π²ΡΠ΄Π½ΠΎΠ²Π»ΡΠ²Π°Π½ΠΎΡ Π΅Π½Π΅ΡΠ³Π΅ΡΠΈΠΊΠΈ, Π·Π°ΠΊΠ»ΠΈΠΊΠ°Ρ Π΄ΠΎ ΠΏΠΎΠ²Π½ΠΎΠ³ΠΎ Ρ ΡΠ²ΠΈΠ΄ΠΊΠΎΠ³ΠΎ ΠΏΠ΅ΡΠ΅Ρ ΠΎΠ΄Ρ Π΄ΠΎ Π²ΡΠ΄Π½ΠΎΠ²Π»ΡΠ²Π°Π½ΠΎΡ Π΅Π½Π΅ΡΠ³Π΅ΡΠΈΠΊΠΈ, ΡΠΎΠ± ΠΏΠΎΠΊΠ»Π°ΡΡΠΈ ΠΊΡΠ°ΠΉ Π·Π°Π»Π΅ΠΆΠ½ΠΎΡΡΡ ΠΠ²ΡΠΎΠΏΠΈ ΡΠ° ΡΠ²ΡΡΡ Π²ΡΠ΄ Π²ΠΈΠΊΠΎΠΏΠ½ΠΎΠ³ΠΎ ΠΏΠ°Π»ΠΈΠ²Π°. Π¦Π΅ ΠΏΡΠΈΠ·Π²Π΅Π»ΠΎ Π΄ΠΎ Π½Π΅ΡΠΊΡΠ½ΡΠ΅Π½Π½ΠΎΡ Π²ΡΠΉΠ½ΠΈ, Π½Π΅Π²ΠΈΠΌΠΎΠ²Π½ΠΈΡ ΡΡΡΠ°ΠΆΠ΄Π°Π½Ρ Ρ ΡΡΡΠ°ΠΆΠ΄Π°Π½Ρ ΠΏΡΠΎΡΡΠ³ΠΎΠΌ Π±ΡΠ»ΡΡ Π½ΡΠΆ ΡΡΠΎΠ»ΡΡΡΡ, Π²ΡΠ΄Π²ΠΎΠ»ΡΠΊΠ°ΡΡΠΈ Π²ΡΠ΄ ΠΎΠ΄Π½ΠΎΠ³ΠΎ ΡΠΏΡΠ»ΡΠ½ΠΎΠ³ΠΎ Π²ΠΎΡΠΎΠ³Π°, ΡΠΊΠΈΠΉ ΠΏΠΎΠ³ΡΠΎΠΆΡΡ ΡΠΏΠΎΠΆΠΈΠ²Π°ΡΠΈ Π²ΡΠ΅: ΠΏΡΠΈΡΠΊΠΎΡΠ΅Π½Π΅ Π½Π°Π³ΡΡΠ²Π°Π½Π½Ρ ΠΊΠ»ΡΠΌΠ°ΡΡ Π½Π° Π²ΠΈΠΊΠΎΠΏΠ½ΠΎΠΌΡ ΠΏΠ°Π»ΠΈΠ²Ρ. Π£ΠΊΡΠ°ΡΠ½ΡΡΠΊΠ° ΡΠ΅ΠΊΡΡΡ EUROSOLAR ΡΠ° ΡΡ ΠΌΠ΅ΡΠ΅ΠΆΡ Π²ΠΆΠ΅ Π΄Π°Π²Π½ΠΎ Π²ΠΈΡΡΡΠΏΠ°ΡΡΡ Π·Π° Π½ΠΎΠ²Ρ Π΅ΠΏΠΎΡ Ρ Π²ΡΠ΄Π½ΠΎΠ²Π»ΡΠ²Π°Π½ΠΎΡ Π΅Π½Π΅ΡΠ³Π΅ΡΠΈΠΊΠΈ Ρ Π‘Ρ ΡΠ΄Π½ΡΠΉ ΠΠ²ΡΠΎΠΏΡ. Π Π°Π·ΠΎΠΌ Π· ΡΡΡΠΌΠ° ΡΠ½ΡΠΈΠΌΠΈ Π½Π°ΡΠΈΠΌΠΈ ΡΠ΅ΠΊΡΡΡΠΌΠΈ ΡΠ° ΡΠ»Π΅Π½Π°ΠΌΠΈ Π½Π° ΡΠ²ΡΠΎΠΏΠ΅ΠΉΡΡΠΊΠΎΠΌΡ ΠΊΠΎΠ½ΡΠΈΠ½Π΅Π½ΡΡ, Π²ΡΠ΄ Π ΠΎΡΡΡ Π΄ΠΎ ΠΡΠ΄Π΅ΡΠ»Π°Π½Π΄ΡΠ², Π° ΡΠ°ΠΊΠΎΠΆ Π²ΡΠ΄ Π’ΡΡΠ΅ΡΡΠΈΠ½ΠΈ Π΄ΠΎ ΠΠ°Π½ΡΡ, EUROSOLAR ΠΏΡΠΎΠΏΠΎΠ½ΡΡ ΡΡ ΠΊΠ»ΡΠΌΠ°ΡΠΈΡΠ½Ρ ΠΌΠΈΡΠ½Ρ ΠΏΠ»Π°ΡΡΠΎΡΠΌΡ. ΠΡΠΎΡ. ΠΡΡΠ΅Ρ ΠΡΠΎΡΠ΄ΠΆ, ΠΡΠ΅Π·ΠΈΠ΄Π΅Π½Ρ EUROSOLAR: β ΠΠ°ΡΡΠ°Π² ΡΠ°Ρ Π΄Π»Ρ ΠΊΠ»ΡΠΌΠ°ΡΠΈΡΠ½ΠΎΡ ΠΌΠΈΡΠ½ΠΎΡ Π΄ΠΈΠΏΠ»ΠΎΠΌΠ°ΡΡΡ, ΡΠΎΠ± ΠΏΡΠΎΡΠΈΡΡΠΎΡΡΠΈ ΡΠΏΡΠ»ΡΠ½ΠΎΠΌΡ Π²ΠΎΡΠΎΠ³Ρ ΠΊΠΎΠΆΠ½ΠΎΠ³ΠΎ: ΠΏΠ΅ΡΠ΅Π΄ΠΎΠ²ΡΠΉ Π΄Π΅ΡΡΠ°Π±ΡΠ»ΡΠ·Π°ΡΡΡ Π²ΠΈΠΊΠΎΠΏΠ½ΠΎΠ³ΠΎ ΠΊΠ»ΡΠΌΠ°ΡΡ. Π¦Π΅ ΠΎΠ΄Π½Π° Π· Π΄Π΅ΡΡΡΠΈ Π΄ΡΠΉ, ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»Π΅Π½ΠΈΡ EUROSOLAR ΡΠΊ ΠΎΡΠ½ΠΎΠ²Π½ΠΈΠΉ ΠΏΠΎΡΡΠ΄ΠΎΠΊ Π΄Π΅Π½Π½ΠΈΠΉ Π½Π°ΡΠΎΠ³ΠΎ ΡΠ°ΡΡ. β Π ΠΌΠΎΠΌΠ΅Π½ΡΡ ΡΠ²ΠΎΠ³ΠΎ Π·Π°ΡΠ½ΡΠ²Π°Π½Π½Ρ Π² 1988 ΡΠΎΡΡ EUROSOLAR ΠΏΡΠ°ΡΡΠ²Π°Π² Π½Π°Π΄ ΠΏΡΠΈΠΏΠΈΠ½Π΅Π½Π½ΡΠΌ Π²ΡΠΉΠ½ΠΈ Π½Π° Π²ΠΈΠΊΠΎΠΏΠ½ΠΎΠΌΡ ΠΏΠ°Π»ΠΈΠ²Ρ ΡΠ»ΡΡ ΠΎΠΌ Π²Π΅Π»ΠΈΠΊΠΎΠ³ΠΎ ΠΏΠ΅ΡΠ΅Ρ ΠΎΠ΄Ρ Π½Π° 100% Π²ΡΠ΄Π½ΠΎΠ²Π»ΡΠ²Π°Π½Ρ Π΅Π½Π΅ΡΠ³ΡΡ. ΠΠ° ΡΠ»ΠΎΠ²Π°ΠΌΠΈ ΠΠ΅ΡΠΌΠ°Π½Π° Π¨ΠΈΡΠ° ( 1944-2010), Π·Π°ΡΠ½ΠΎΠ²Π½ΠΈΠΊΠ° EUROSOLAR: Β« ΠΡΠ΄Π½ΠΎΠ²Π»ΡΠ²Π°Π½Ρ Π΄ΠΆΠ΅ΡΠ΅Π»Π° Π΅Π½Π΅ΡΠ³ΡΡ ΡΡΠ²ΠΎΡΡΡΡΡ ΠΌΠΈΡ Β». ΠΠΏΠΎΡ Π° Π²ΠΈΠΊΠΎΠΏΠ½ΠΎ-ΡΠ΄Π΅ΡΠ½ΠΈΡ Π·Π°Π³ΡΠΎΠ· ΠΏΠΎΠ²ΠΈΠ½Π½Π° Π·Π°ΠΊΡΠ½ΡΠΈΡΠΈΡΡ, ΠΏΠΎΠ²ΠΈΠ½Π΅Π½ ΠΏΠΎΡΠ°ΡΠΈΡΡ Π΅ΠΊΠ·ΠΈΡΡΠ΅Π½ΡΡΠ°Π»ΡΠ½ΠΈΠΉ ΡΠΎΠΊΡΡ: www.earthdecade.org. EUROSOLAR ΡΠ°ΠΊΠΎΠΆ Π·Π°ΠΊΠ»ΠΈΠΊΠ°Ρ Π΄ΠΎ Π·ΠΌΡΠ½ΠΈ ΠΌΠΈΡΠ»Π΅Π½Π½Ρ Π² Π±ΡΠΊ ΠΊΠ»ΡΠΌΠ°ΡΠΈΡΠ½ΠΎΡ ΠΌΠΈΡΠ½ΠΎΡ Π΄ΠΈΠΏΠ»ΠΎΠΌΠ°ΡΡΡ, ΡΠΊΠ° Π²ΠΈΠ·Π½Π°Ρ Ρ Π±ΠΎΡΠ΅ΡΡΡΡ Π· Π²ΠΈΠΊΠΎΠΏΠ½ΠΈΠΌΠΈ Π·Π°Π»Π΅ΠΆΠ½ΠΎΡΡΡΠΌΠΈ ΡΠΊ Π½Π°ΠΉΠ±ΡΠ»ΡΡΠΈΠΉ ΡΠΏΡΠ»ΡΠ½ΠΈΠΉ Π²ΠΎΡΠΎΠ³ Π»ΡΠ΄ΡΡΠ²Π°. https: //www.eurosolar.org/en/2022/02/01/regenerative-earth-decade-eurosolars-call-for-climate-peace-diplom cy/ ΠΡΠΊΡΡΡΠΎΠ΅ ΠΏΠΈΡΡΠΌΠΎ ΠΏΡΠ΅Π·ΠΈΠ΄Π΅Π½ΡΠ°ΠΌ ΠΡΡΠΈΠ½Ρ, ΠΠ°ΠΉΠ΄Π΅Π½Ρ, ΠΠ΅Π»Π΅Π½ΡΠΊΠΎΠΌΡ ΠΈ ΠΡΠΊΠ°ΡΠ΅Π½ΠΊΠΎ: EUROSOLAR, ΠΠ²ΡΠΎΠΏΠ΅ΠΉΡΠΊΠ°Ρ Π°ΡΡΠΎΡΠΈΠ°ΡΠΈΡ Π²ΠΎΠ·ΠΎΠ±Π½ΠΎΠ²Π»ΡΠ΅ΠΌΠΎΠΉ ΡΠ½Π΅ΡΠ³Π΅ΡΠΈΠΊΠΈ, ΠΏΡΠΈΠ·ΡΠ²Π°Π΅Ρ ΠΊ Π½Π΅ΠΌΠ΅Π΄Π»Π΅Π½Π½ΠΎΠΌΡ ΠΏΡΠ΅ΠΊΡΠ°ΡΠ΅Π½ΠΈΡ ΠΊΠ»ΠΈΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠ³ΠΎ ΠΎΠ³Π½Ρ ΠΈ Π·Π°ΠΊΠ»ΡΡΠ΅Π½ΠΈΡ ΠΏΠΎΡΡΠΎΡΠ½Π½ΠΎΠ³ΠΎ ΠΊΠ»ΠΈΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠ³ΠΎ ΠΌΠΈΡΠ½ΠΎΠ³ΠΎ ΡΠΎΠ³Π»Π°ΡΠ΅Π½ΠΈΡ ΠΏΠΎ Π²ΡΠ΅ΠΉ ΠΠΎΡΡΠΎΡΠ½ΠΎΠΉ ΠΠ²ΡΠΎΠΏΠ΅ β ΠΈ, ΡΠ°ΠΊΠΈΠΌ ΠΎΠ±ΡΠ°Π·ΠΎΠΌ, ΠΊ Π½Π°ΡΠ°Π»Ρ ΠΌΠ½ΠΎΠ³ΠΎΡΡΠΎΡΠΎΠ½Π½Π΅ΠΉ ΠΊΠ»ΠΈΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠΉ ΠΌΠΈΡΠ½ΠΎΠΉ Π΄ΠΈΠΏΠ»ΠΎΠΌΠ°ΡΠΈΠΈ. ΠΠ°ΠΏΠ°Π΄Π΅Π½ΠΈΠ΅ ΡΠΎΡΡΠΈΠΉΡΠΊΠΈΡ Π²ΠΎΠ΅Π½Π½ΡΡ Π½Π° ΡΠΊΡΠ°ΠΈΠ½ΡΠΊΠΈΠΉ Π½Π°ΡΠΎΠ΄ ΠΈ Π΅Π³ΠΎ ΠΏΡΠ°Π²ΠΈΡΠ΅Π»ΡΡΡΠ²ΠΎ Π΄ΠΎΠ»ΠΆΠ½ΠΎ Π±ΡΡΡ ΠΎΡΡΠΆΠ΄Π΅Π½ΠΎ ΡΠ°ΠΌΡΠΌ ΡΠ΅ΡΠΈΡΠ΅Π»ΡΠ½ΡΠΌ ΠΎΠ±ΡΠ°Π·ΠΎΠΌ ΠΈ Π½Π΅ΠΌΠ΅Π΄Π»Π΅Π½Π½ΠΎ ΠΎΡΡΠ°Π½ΠΎΠ²Π»Π΅Π½ΠΎ. ΠΡΠ΅ ΡΡΡΠ°Π½Ρ, ΠΊΠΎΡΠΎΡΡΠ΅ ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΡΡ Π²ΠΎΠ΅Π½Π½ΡΠ΅ ΡΠΎΡΠ·Ρ Π΄Π»Ρ ΠΏΠΎΡΡΠΎΡΠ½Π½ΠΎΠΉ ΠΊΠΎΡΡΠ΅ΠΊΡΠΈΡΠΎΠ²ΠΊΠΈ ΡΠ²ΠΎΠΈΡ ΡΡΠ΅Ρ ΠΈΠ½ΡΠ΅ΡΠ΅ΡΠΎΠ² ΠΈ ΠΏΠΎΡΡΠΎΡΠ½Π½ΠΎΠΉ Π±ΠΎΡΡΠ±Ρ Π·Π° ΡΠ°ΠΊΡΠΈΡΠ΅ΡΠΊΠΎΠ΅ ΠΈ ΡΡΡΠ°ΡΠ΅Π³ΠΈΡΠ΅ΡΠΊΠΎΠ΅ ΠΏΡΠ΅ΠΈΠΌΡΡΠ΅ΡΡΠ²ΠΎ, Π΄ΠΎΠ»ΠΆΠ½Ρ ΠΏΡΠ΅ΠΊΡΠ°ΡΠΈΡΡ ΡΠ²ΠΎΡ Π΄Π΅ΡΡΠ°Π±ΠΈΠ»ΠΈΠ·ΠΈΡΡΡΡΡΡ ΠΏΡΠ°ΠΊΡΠΈΠΊΡ. ΠΡΠ΅ Π²ΠΎΠ²Π»Π΅ΡΠ΅Π½Π½ΡΠ΅ ΡΡΠΎΡΠΎΠ½Ρ Π΄ΠΎΠ»ΠΆΠ½Ρ ΠΏΡΠΎΡΠ½ΡΡΡΡΡ: ΠΠ°Π»ΠΎ ΡΠΎΠ³ΠΎ, ΡΡΠΎ ΠΌΡ Π²ΡΠ΅ ΡΠΌΠΎΡΡΠΈΠΌ Π² ΡΠ΄Π΅ΡΠ½ΡΡ Π±Π΅Π·Π΄Π½Ρ ΠΈΠ·-Π·Π° Π΄Π»ΠΈΡΠ΅Π»ΡΠ½ΡΡ Π½Π΅ΡΠ΄Π°ΡΠ½ΡΡ ΠΏΠΎΠΏΡΡΠΎΠΊ ΡΠ°Π·ΠΎΡΡΠΆΠ΅Π½ΠΈΡ β ΠΏΠ»Π°Π½Π΅ΡΠ° ΡΠ°ΠΊΠΆΠ΅ Π½Π°Ρ ΠΎΠ΄ΠΈΡΡΡ Π² Π½Π΅ΠΊΠΎΠ½ΡΡΠΎΠ»ΠΈΡΡΠ΅ΠΌΠΎΠΉ ΠΊΠ»ΠΈΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠΉ ΡΠΏΠΈΡΠ°Π»ΠΈ, ΠΊΠΎΡΠΎΡΠ°Ρ ΠΏΠΎΡΡΠΈ Π½Π°Π²Π΅ΡΠ½ΡΠΊΠ° ΡΠ΄Π΅Π»Π°Π΅Ρ Π΅Π΅ Π½Π΅ΠΏΡΠΈΠ³ΠΎΠ΄Π½ΠΎΠΉ Π΄Π»Ρ ΠΆΠΈΠ·Π½ΠΈ ΡΠΆΠ΅ Π² ΡΡΠΎΠΌ ΠΏΠΎΠΊΠΎΠ»Π΅Π½ΠΈΠΈ. EUROSOLAR, ΠΠ²ΡΠΎΠΏΠ΅ΠΉΡΠΊΠ°Ρ Π°ΡΡΠΎΡΠΈΠ°ΡΠΈΡ Π²ΠΎΠ·ΠΎΠ±Π½ΠΎΠ²Π»ΡΠ΅ΠΌΡΡ ΠΈΡΡΠΎΡΠ½ΠΈΠΊΠΎΠ² ΡΠ½Π΅ΡΠ³ΠΈΠΈ, ΠΏΡΠΈΠ·ΡΠ²Π°Π΅Ρ ΠΊ ΠΏΠΎΠ»Π½ΠΎΠΌΡ ΠΈ Π±ΡΡΡΡΠΎΠΌΡ ΠΏΠ΅ΡΠ΅Ρ ΠΎΠ΄Ρ Π½Π° Π²ΠΎΠ·ΠΎΠ±Π½ΠΎΠ²Π»ΡΠ΅ΠΌΡΠ΅ ΠΈΡΡΠΎΡΠ½ΠΈΠΊΠΈ ΡΠ½Π΅ΡΠ³ΠΈΠΈ, ΡΡΠΎΠ±Ρ ΠΏΠΎΠ»ΠΎΠΆΠΈΡΡ ΠΊΠΎΠ½Π΅Ρ Π·Π°Π²ΠΈΡΠΈΠΌΠΎΡΡΠΈ ΠΠ²ΡΠΎΠΏΡ ΠΈ Π²ΡΠ΅Π³ΠΎ ΠΌΠΈΡΠ° ΠΎΡ ΠΈΡΠΊΠΎΠΏΠ°Π΅ΠΌΠΎΠ³ΠΎ ΡΠΎΠΏΠ»ΠΈΠ²Π°. ΠΠ½Π° ΠΏΡΠΈΠ²Π΅Π»Π° ΠΊ Π±Π΅ΡΠΊΠΎΠ½Π΅ΡΠ½ΡΠΌ Π²ΠΎΠΉΠ½Π°ΠΌ, Π½Π΅Π²ΡΡΠ°Π·ΠΈΠΌΡΠΌ ΡΡΡΠ°Π΄Π°Π½ΠΈΡΠΌ ΠΈ Π½Π΅ΡΡΠ°ΡΡΡΡΠΌ Π½Π° ΠΏΡΠΎΡΡΠΆΠ΅Π½ΠΈΠΈ Π±ΠΎΠ»Π΅Π΅ Π²Π΅ΠΊΠ°, ΠΎΡΠ²Π»Π΅ΠΊΠ°Ρ Π½Π°Ρ ΠΎΡ ΠΎΠ΄Π½ΠΎΠ³ΠΎ ΠΎΠ±ΡΠ΅Π³ΠΎ Π²ΡΠ°Π³Π°, ΠΊΠΎΡΠΎΡΡΠΉ ΡΠ³ΡΠΎΠΆΠ°Π΅Ρ ΠΏΠΎΠ³Π»ΠΎΡΠΈΡΡ Π²ΡΠ΅Ρ Π½Π°Ρ: ΡΡΠΊΠΎΡΠ΅Π½Π½ΠΎΠ³ΠΎ Π³Π»ΠΎΠ±Π°Π»ΡΠ½ΠΎΠ³ΠΎ ΠΏΠΎΡΠ΅ΠΏΠ»Π΅Π½ΠΈΡ, Π²ΡΠ·Π²Π°Π½Π½ΠΎΠ³ΠΎ ΠΈΡΠΊΠΎΠΏΠ°Π΅ΠΌΡΠΌ ΡΠΎΠΏΠ»ΠΈΠ²ΠΎΠΌ. Π£ΠΊΡΠ°ΠΈΠ½ΡΠΊΠ°Ρ ΡΠ΅ΠΊΡΠΈΡ EUROSOLAR ΠΈ Π΅Π΅ ΡΠ΅ΡΠΈ Π΄Π°Π²Π½ΠΎ Π²ΡΡΡΡΠΏΠ°ΡΡ Π·Π° Π½ΠΎΠ²ΡΡ ΡΡΡ Ρ Π²ΠΎΠ·ΠΎΠ±Π½ΠΎΠ²Π»ΡΠ΅ΠΌΡΠΌΠΈ ΠΈΡΡΠΎΡΠ½ΠΈΠΊΠ°ΠΌΠΈ ΡΠ½Π΅ΡΠ³ΠΈΠΈ Π² ΠΠΎΡΡΠΎΡΠ½ΠΎΠΉ ΠΠ²ΡΠΎΠΏΠ΅. ΠΠΌΠ΅ΡΡΠ΅ ΡΠΎ Π²ΡΠ΅ΠΌΠΈ Π΄ΡΡΠ³ΠΈΠΌΠΈ Π½Π°ΡΠΈΠΌΠΈ ΡΠ΅ΠΊΡΠΈΡΠΌΠΈ ΠΈ ΡΠ»Π΅Π½Π°ΠΌΠΈ ΠΏΠΎ Π²ΡΠ΅ΠΌΡ Π΅Π²ΡΠΎΠΏΠ΅ΠΉΡΠΊΠΎΠΌΡ ΠΊΠΎΠ½ΡΠΈΠ½Π΅Π½ΡΡ, ΠΎΡ Π ΠΎΡΡΠΈΠΈ Π΄ΠΎ ΠΠΈΠ΄Π΅ΡΠ»Π°Π½Π΄ΠΎΠ² ΠΈ ΠΎΡ Π’ΡΡΡΠΈΠΈ Π΄ΠΎ ΠΠ°Π½ΠΈΠΈ, EUROSOLAR ΠΏΡΠ΅Π΄Π»Π°Π³Π°Π΅Ρ ΡΡΡ ΠΏΠ»Π°ΡΡΠΎΡΠΌΡ ΠΌΠΈΡΠ° ΠΊΠ»ΠΈΠΌΠ°ΡΡ. ΠΡΠΎΡΠ΅ΡΡΠΎΡ ΠΠ΅ΡΠ΅Ρ ΠΡΠΎΠ³Π΅, ΠΏΡΠ΅Π·ΠΈΠ΄Π΅Π½Ρ EUROSOLAR: β ΠΠ°ΡΡΠ°Π»ΠΎ Π²ΡΠ΅ΠΌΡ Π΄Π»Ρ ΠΊΠ»ΠΈΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠΉ ΠΌΠΈΡΠ½ΠΎΠΉ Π΄ΠΈΠΏΠ»ΠΎΠΌΠ°ΡΠΈΠΈ, ΡΡΠΎΠ±Ρ ΠΏΡΠΎΡΠΈΠ²ΠΎΡΡΠΎΡΡΡ ΠΎΠ±ΡΠ΅ΠΌΡ Π΄Π»Ρ Π²ΡΠ΅Ρ Π²ΡΠ°Π³Ρ: Π΄Π΅ΡΡΠ°Π±ΠΈΠ»ΠΈΠ·Π°ΡΠΈΠΈ ΠΊΠ»ΠΈΠΌΠ°ΡΠ° Π·Π° ΡΡΠ΅Ρ ΠΏΠ΅ΡΠ΅Π΄ΠΎΠ²ΠΎΠ³ΠΎ ΠΈΡΠΊΠΎΠΏΠ°Π΅ΠΌΠΎΠ³ΠΎ ΡΠΎΠΏΠ»ΠΈΠ²Π°. ΠΡΠΎ ΠΎΠ΄Π½ΠΎ ΠΈΠ· Π΄Π΅ΡΡΡΠΈ Π΄Π΅ΠΉΡΡΠ²ΠΈΠΉ, ΠΊΠΎΡΠΎΡΡΠ΅ EUROSOLAR ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»ΡΠ΅Ρ ΠΊΠ°ΠΊ ΡΠ°ΠΌΡΡ Π²Π°ΠΆΠ½ΡΡ ΠΏΠΎΠ²Π΅ΡΡΠΊΡ Π΄Π½Ρ Π½Π°ΡΠ΅Π³ΠΎ Π²ΡΠ΅ΠΌΠ΅Π½ΠΈ β. ΠΠΎΠΊΡΠΎΡ ΠΡΠΈΠ³ΠΈΡΡΠ΅ Π¨ΠΌΠΈΠ΄Ρ, Π²ΠΈΡΠ΅-ΠΏΡΠ΅Π·ΠΈΠ΄Π΅Π½Ρ ΠΈ ΡΠ»Π΅Π½ ΠΏΡΠ°Π²Π»Π΅Π½ΠΈΡ EUROSOLAR ΠΠ΅ΡΠΌΠ°Π½ΠΈΡ: β ΠΠ°ΡΡΡΠΏΠΈΠ»ΠΎ Π²ΡΠ΅ΠΌΡ Π²ΠΎΠ·ΠΎΠ±Π½ΠΎΠ²Π»ΡΠ΅ΠΌΠΎΠ³ΠΎ ΠΌΠΈΡΠ°, ΡΠ°ΡΡΡ Π½Π°ΡΠ΅ΠΉ ΠΏΡΠΎΠ³ΡΠ°ΠΌΠΌΡ β ΠΠΎΠ·ΠΎΠ±Π½ΠΎΠ²Π»ΡΠ΅ΠΌΠΎΠ΅ Π΄Π΅ΡΡΡΠΈΠ»Π΅ΡΠΈΠ΅ β. ΠΠ½ Π²ΡΡΡΡΠΏΠ°Π΅Ρ Π·Π° ΠΏΠ΅ΡΠ΅ΠΎΡΠΌΡΡΠ»Π΅Π½ΠΈΠ΅ ΠΈ ΠΌΠΈΡΠ½ΡΠ΅ Π΄Π΅ΠΉΡΡΠ²ΠΈΡ Π²ΠΎ ΠΈΠΌΡ Π½Π°ΡΠ΅Π³ΠΎ ΠΎΠ±ΡΠ΅Π³ΠΎ Π±ΡΠ΄ΡΡΠ΅Π³ΠΎ. Π‘ ΠΌΠΎΠΌΠ΅Π½ΡΠ° ΡΠ²ΠΎΠ΅Π³ΠΎ ΠΎΡΠ½ΠΎΠ²Π°Π½ΠΈΡ Π² 1988 Π³ΠΎΠ΄Ρ ΠΊΠΎΠΌΠΏΠ°Π½ΠΈΡ EUROSOLAR ΡΠ°Π±ΠΎΡΠ°Π΅Ρ Π½Π°Π΄ ΡΠ΅ΠΌ, ΡΡΠΎΠ±Ρ ΠΏΠΎΠ»ΠΎΠΆΠΈΡΡ ΠΊΠΎΠ½Π΅Ρ Π²ΠΎΠΉΠ½Π°ΠΌ Π·Π° ΠΈΡΠΊΠΎΠΏΠ°Π΅ΠΌΠΎΠ΅ ΡΠΎΠΏΠ»ΠΈΠ²ΠΎ ΠΏΡΡΠ΅ΠΌ ΠΌΠ°ΡΡΡΠ°Π±Π½ΠΎΠ³ΠΎ ΠΏΠ΅ΡΠ΅Ρ ΠΎΠ΄Π° Π½Π° 100% Π²ΠΎΠ·ΠΎΠ±Π½ΠΎΠ²Π»ΡΠ΅ΠΌΡΠ΅ ΠΈΡΡΠΎΡΠ½ΠΈΠΊΠΈ ΡΠ½Π΅ΡΠ³ΠΈΠΈ. ΠΠΎ ΡΠ»ΠΎΠ²Π°ΠΌ ΠΠ΅ΡΠΌΠ°Π½Π° Π¨Π΅Π΅ΡΠ° ( 1944-2010), ΠΎΡΠ½ΠΎΠ²Π°ΡΠ΅Π»Ρ EUROSOLAR: β ΠΠΎΠ·ΠΎΠ±Π½ΠΎΠ²Π»ΡΠ΅ΠΌΡΠ΅ ΠΈΡΡΠΎΡΠ½ΠΈΠΊΠΈ ΡΠ½Π΅ΡΠ³ΠΈΠΈ ΡΠΎΠ·Π΄Π°ΡΡ ΠΌΠΈΡ β. ΠΠ΅ΠΊ ΠΈΡΠΊΠΎΠΏΠ°Π΅ΠΌΠΎ-ΡΠ΄Π΅ΡΠ½ΡΡ ΡΠ³ΡΠΎΠ· Π΄ΠΎΠ»ΠΆΠ΅Π½ Π·Π°ΠΊΠΎΠ½ΡΠΈΡΡΡΡ, Π΄ΠΎΠ»ΠΆΠ½Π° Π½Π°ΡΠ°ΡΡΡΡ ΡΠΊΠ·ΠΈΡΡΠ΅Π½ΡΠΈΠ°Π»ΡΠ½Π°Ρ ΠΎΡΠΈΠ΅Π½ΡΠ°ΡΠΈΡ: https: //www.earthdecade.org. EUROSOLAR ΠΏΡΠΈΠ·ΡΠ²Π°Π΅Ρ ΠΊ ΠΏΠ΅ΡΠ΅ΠΎΡΠΌΡΡΠ»Π΅Π½ΠΈΡ Π² ΡΡΠΎΡΠΎΠ½Ρ ΠΊΠ»ΠΈΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠΉ ΠΌΠΈΡΠ½ΠΎΠΉ Π΄ΠΈΠΏΠ»ΠΎΠΌΠ°ΡΠΈΠΈ, ΠΊΠΎΡΠΎΡΠ°Ρ ΠΏΡΠΈΠ·Π½Π°Π΅Ρ ΠΈ Π±ΠΎΡΠ΅ΡΡΡ Ρ ΠΈΡΠΊΠΎΠΏΠ°Π΅ΠΌΠΎΠΉ Π·Π°Π²ΠΈΡΠΈΠΌΠΎΡΡΡΡ ΠΊΠ°ΠΊ Π²Π΅Π»ΠΈΡΠ°ΠΉΡΠΈΠΌ ΠΎΠ±ΡΠΈΠΌ Π²ΡΠ°Π³ΠΎΠΌ ΡΠ΅Π»ΠΎΠ²Π΅ΡΠ΅ΡΡΠ²Π°.https: //www.eurosolar.org/en/2022/02/01/regenerative-earth-decade-eurosolars-call-for-climate-peace-diplom cy/ Independent of political parties, institutions, companies and interest groups, EUROSOLAR has been developing and stimulating political and economic action drafts and concepts for the introduction of renewable energies since 1988. This ranges from market introduction strategies to proposals for further research and development policy, from tax policy subsidies to arms conversion with solar energy, from the contribution of solar energy for the Global South to agricultural, transport and construction policy. EuropΓ€ische Vereinigung fΓΌr Erneuerbare Energien e. V.
articles_df = articles_df[articles_df["lang"] == "en"]
Our exploration revealed a small number of articles containing non-English content (some in German and 1 with a Russian section). Since most LLMs and embedding models are primarily trained on English text, removing these articles ensures compatibility with our chosen models for this notebook. For simplicity, we'll only focus on supporting English queries and responses within this RAG pipeline.
Introducing multilingual capabilities into a RAG pipeline presents a layer of complexity. Here's a breakdown of some key challenges:
Let us further analyze the contents of the articles. However, before we do so let us define the meaning of characters, tokens and words:
sns.histplot(articles_df["article"].map(len), kde=True)
plt.title("Amount of characters in articles")
plt.xlabel("Amount of characters")
plt.ylabel("Number of articles")
median_char_len = articles_df["article"].map(len).median()
mean_char_len = articles_df["article"].map(len).mean()
plt.axvline(median_char_len, color='r', linestyle='--', label=f"Median character amount: {median_char_len:.2f}")
plt.axvline(mean_char_len, color='g', linestyle='--', label=f"Mean character amount: {mean_char_len:.2f}")
plt.legend()
plt.show()
sns.histplot(articles_df["article"].map(lambda x: len(x.split())), kde=True)
plt.title("Amount of words in articles")
plt.xlabel("Amount of words")
plt.ylabel("Number of articles")
median_word_len = articles_df["article"].map(lambda x: len(x.split())).median()
mean_word_len = articles_df["article"].map(lambda x: len(x.split())).mean()
plt.axvline(median_word_len, color='r', linestyle='--', label=f"Median word amount: {median_word_len:.2f}")
plt.axvline(mean_word_len, color='g', linestyle='--', label=f"Mean word amount: {mean_word_len:.2f}")
plt.legend()
plt.show()
nlp = English()
tokenizer = nlp.tokenizer
sns.histplot(articles_df["article"].map(lambda x: len(tokenizer(x))), kde=True)
plt.title("Amount of tokens in articles")
plt.xlabel("Amount of tokens")
plt.ylabel("Number of articles")
median_token_len = articles_df["article"].map(lambda x: len(tokenizer(x))).median()
mean_token_len = articles_df["article"].map(lambda x: len(tokenizer(x))).mean()
plt.axvline(median_token_len, color='r', linestyle='--', label=f"Median token amount: {median_token_len:.2f}")
plt.axvline(mean_token_len, color='g', linestyle='--', label=f"Mean token amount: {mean_token_len:.2f}")
plt.legend()
plt.show()
all_tokens = [token.text for article in articles_df["article"] for token in tokenizer(article)]
# remove non-alphabetic tokens such as punctuation
alpha_tokens = [token for token in all_tokens if token.isalpha()]
alpha_tokens = [token.lower() for token in alpha_tokens]
alpha_token_counts = Counter(alpha_tokens)
sns.barplot(
x=[count for token, count in alpha_token_counts.most_common(20)],
y=[token for token, count in alpha_token_counts.most_common(20)],
hue=[token for token, count in alpha_token_counts.most_common(20)]
)
plt.title("Most common alphabetic tokens")
plt.xlabel("Count")
plt.ylabel("Token")
plt.show()
# remove stopwords such as 'the', 'a', 'and'
non_stop_tokens = [token for token in alpha_tokens if not nlp.vocab[token].is_stop]
non_stop_token_counts = Counter(non_stop_tokens)
sns.barplot(
x=[count for token, count in non_stop_token_counts.most_common(20)],
y=[token for token, count in non_stop_token_counts.most_common(20)],
hue=[token for token, count in non_stop_token_counts.most_common(20)]
)
plt.title("Most common non-stopword tokens")
plt.xlabel("Count")
plt.ylabel("Token")
plt.show()
As one would expect in a dataset of cleantech news articles most of the tokens that are not punctation or stopwords revolve around the subjects of energy, climate, and technology. This is a good sign that the dataset is relevant to the topic at hand. The "s" token comes up frequently, which is likely due to the possessive form of words. With an average of around 700 words per article, we can expect a good amount of information to be present in each article and an average reading time of around 3-4 minutes.
The Flesch Reading Ease Score is a tool used to evaluate how easy it is to understand a text based on the length of sentences and the number of syllables per word. Scores can range from -100 (very difficult to read) to 100 (very easy to read). This metric can be useful for assessing the readability of our articles and ensuring they are accessible to a broad audience.
articles_df["readability"] = articles_df["article"].apply(flesch_reading_ease)
sns.histplot(articles_df["readability"], kde=True)
plt.title("Flesch Reading Ease of articles")
plt.xlabel("Flesch Reading Ease")
plt.ylabel("Number of articles")
mean_readability = articles_df["readability"].mean()
plt.axvline(mean_readability, color='g', linestyle='--', label=f"Mean readability: {mean_readability:.2f}")
plt.legend()
plt.show()
domains = articles_df["domain"].unique()
# Setup the subplots based on the number of domains
plots_per_row = 3
num_rows = (len(domains) + 2) // plots_per_row
plot_height = 6
fig, axes = plt.subplots(num_rows, plots_per_row, figsize=(plot_height * plots_per_row, plot_height * num_rows))
axes = axes.flatten() # Flatten the axes array for easier iteration
# Plot for each domain
for i, domain in enumerate(domains):
domain_articles = articles_df[articles_df["domain"] == domain]
sns.histplot(domain_articles["readability"], kde=True, ax=axes[i], bins=30)
axes[i].set_title(f'Readability of {domain}')
axes[i].set_xlabel('Flesch Reading Ease Score')
axes[i].set_ylabel("Number of articles")
mean_readability = domain_articles["readability"].mean()
axes[i].axvline(mean_readability, color='g', linestyle='--', label=f"Mean readability: {mean_readability:.2f}")
# remove the empty plots
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
plt.tight_layout()
plt.show()
To gauge the readability of our articles, we calculated the Flesch-Kincaid Reading Ease Score. The average score of around 45 indicates a "fairly easy" reading level, which is positive news. This suggests the content is likely accessible to a broad audience and, consequently, understandable by our RAG pipeline as well.
Our analysis revealed a consistent average Flesch-Kincaid Reading Ease Score across most of the identified domains, with minor variations. This indicates a relatively consistent level of readability across different publishers within the dataset.
Finally we will save the cleaned dataset to a new file in the data/silver folder.
silver_folder = data_folder / "silver"
if not silver_folder.exists():
silver_folder.mkdir()
articles_df.to_csv(silver_folder / "articles.csv", index=False)
Next we will analyze the provided evaluation questions and ensure that they match the content of the articles.
human_eval_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 23 entries, 1 to 23 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 question_id 23 non-null int64 1 question 23 non-null object 2 relevant_chunk 23 non-null object 3 article_url 23 non-null object dtypes: int64(1), object(3) memory usage: 920.0+ bytes
human_eval_df.rename(columns={"relevant_chunk":"relevant_section","article_url": "url"}, inplace=True)
human_eval_df.drop(columns=["question_id"], inplace=True)
human_eval_df.head()
| question | relevant_section | url | |
|---|---|---|---|
| example_id | |||
| 1 | What is the innovation behind LeclanchΓ©'s new ... | LeclanchΓ© said it has developed an environment... | https://www.sgvoice.net/strategy/technology/23... |
| 2 | What is the EUβs Green Deal Industrial Plan? | The Green Deal Industrial Plan is a bid by the... | https://www.sgvoice.net/policy/25396/eu-seeks-... |
| 3 | What is the EUβs Green Deal Industrial Plan? | The European counterpart to the US Inflation R... | https://www.pv-magazine.com/2023/02/02/europea... |
| 4 | What are the four focus areas of the EU's Gree... | The new plan is fundamentally focused on four ... | https://www.sgvoice.net/policy/25396/eu-seeks-... |
| 5 | When did the cooperation between GM and Honda ... | What caught our eye was a new hookup between G... | https://cleantechnica.com/2023/05/08/general-m... |
sns.histplot(human_eval_df["question"].map(len), kde=True)
plt.title("Question Character Length Distribution")
plt.xlabel("Character Length")
plt.ylabel("Count")
mean_char_len = human_eval_df["question"].map(len).mean()
plt.axvline(mean_char_len, color='r', linestyle='--', label=f"Mean character amount: {mean_char_len:.2f}")
plt.legend()
plt.show()
missing_articles = human_eval_df.copy()
missing_articles = missing_articles[~human_eval_df["url"].isin(articles_df["url"])]
missing_articles
| question | relevant_section | url | |
|---|---|---|---|
| example_id | |||
| 1 | What is the innovation behind LeclanchΓ©'s new ... | LeclanchΓ© said it has developed an environment... | https://www.sgvoice.net/strategy/technology/23... |
| 2 | What is the EUβs Green Deal Industrial Plan? | The Green Deal Industrial Plan is a bid by the... | https://www.sgvoice.net/policy/25396/eu-seeks-... |
| 4 | What are the four focus areas of the EU's Gree... | The new plan is fundamentally focused on four ... | https://www.sgvoice.net/policy/25396/eu-seeks-... |
Our exploration has identified instances where articles linked to specific questions appear to be missing from the dataset. To determine the root cause, let's investigate whether these articles are genuinely absent or if inconsistencies in URL formatting are creating the illusion of missing data. Normalizing the URLs across the dataset will help us differentiate between these two scenarios.
def normalize_url(url: str) -> str:
url = url.replace("https://", "")
url = url.replace("http://", "")
url = url.replace("www.", "")
url = url.rstrip("/")
return url
articles_df["url"] = articles_df["url"].map(normalize_url)
human_eval_df["url"] = human_eval_df["url"].map(normalize_url)
missing_articles = human_eval_df.copy()
missing_articles = missing_articles[~human_eval_df["url"].isin(articles_df["url"])]
missing_articles
| question | relevant_section | url | |
|---|---|---|---|
| example_id | |||
| 1 | What is the innovation behind LeclanchΓ©'s new ... | LeclanchΓ© said it has developed an environment... | sgvoice.net/strategy/technology/23971/leclanch... |
| 2 | What is the EUβs Green Deal Industrial Plan? | The Green Deal Industrial Plan is a bid by the... | sgvoice.net/policy/25396/eu-seeks-competitive-... |
| 4 | What are the four focus areas of the EU's Gree... | The new plan is fundamentally focused on four ... | sgvoice.net/policy/25396/eu-seeks-competitive-... |
We also know from previous analysis that some duplicate articles from the "energyvoice" domain so we will also normalize these URLs.
missing_articles["url"] = missing_articles["url"].map(lambda x: x.replace("sgvoice.net", "sgvoice.energyvoice.com"))
missing_articles[~missing_articles["url"].isin(articles_df["url"])]
| question | relevant_section | url | |
|---|---|---|---|
| example_id |
human_eval_df.loc[missing_articles.index, "url"] = missing_articles["url"]
human_eval_df[human_eval_df["url"].isin(articles_df["url"])]
| question | relevant_section | url | |
|---|---|---|---|
| example_id | |||
| 1 | What is the innovation behind LeclanchΓ©'s new ... | LeclanchΓ© said it has developed an environment... | sgvoice.energyvoice.com/strategy/technology/23... |
| 2 | What is the EUβs Green Deal Industrial Plan? | The Green Deal Industrial Plan is a bid by the... | sgvoice.energyvoice.com/policy/25396/eu-seeks-... |
| 3 | What is the EUβs Green Deal Industrial Plan? | The European counterpart to the US Inflation R... | pv-magazine.com/2023/02/02/european-commission... |
| 4 | What are the four focus areas of the EU's Gree... | The new plan is fundamentally focused on four ... | sgvoice.energyvoice.com/policy/25396/eu-seeks-... |
| 5 | When did the cooperation between GM and Honda ... | What caught our eye was a new hookup between G... | cleantechnica.com/2023/05/08/general-motors-se... |
| 6 | Did Colgate-Palmolive enter into PPA agreement... | Scout Clean Energy, a Colorado-based renewable... | solarindustrymag.com/scout-and-colgate-palmoli... |
| 7 | What is the status of ZeroAvia's hydrogen fuel... | In December, the US startup ZeroAvia announced... | cleantechnica.com/2023/01/02/the-wait-for-hydr... |
| 8 | What is the "Danger Season"? | As spring turns to summer and the days warm up... | cleantechnica.com/2023/05/15/what-does-a-norma... |
| 9 | Is Mississipi an anti-ESG state? | Mississippi is among two dozen or so states in... | cleantechnica.com/2023/05/15/mississippi-takes... |
| 10 | Can you hang solar panels on garden fences? | Scaling down from the farm to the garden level... | cleantechnica.com/2023/05/18/solar-panels-for-... |
| 11 | Who develops quality control systems for ocean... | Scientists from the Chinese Academy of Science... | azocleantech.com/news.aspx?newsID=32873 |
| 12 | Why are milder winters detrimental for grapes ... | Since grapes and apples are perennial species,... | azocleantech.com/news.aspx?newsID=33040 |
| 13 | What are the basic recycling steps for solar p... | There are some simple recycling steps that can... | azocleantech.com/news.aspx?newsID=33143 |
| 14 | Why does melting ice contribute to global warm... | Whereas white ice reflects the sun's rays, a d... | azocleantech.com/news.aspx?newsID=33149 |
| 15 | Does the Swedish government plan bans on new p... | The Swedish government has proposed a ban on n... | azocleantech.com/news.aspx?newsID=33174 |
| 16 | Where do the turbines used in Icelandic geothe... | Minister Nishimura mentioned that most geother... | thinkgeoenergy.com/japan-and-iceland-agree-on-... |
| 17 | Who is the target user for Leapfrog Energy? | OβBrien added, βSubsurface specialists need fl... | thinkgeoenergy.com/seequent-expands-subsurface... |
| 18 | What is Agrivoltaics? | Agrivoltaics, the integration of food producti... | pv-magazine.com/2023/03/31/new-software-modeli... |
| 19 | What is Agrivoltaics? | Agrivoltaics refers to the conduct of agricult... | cleantechnica.com/2022/12/18/agrivoltaics-goes... |
| 20 | Why is cannabis cultivation moving indoors? | Cannabis cultivation can take place outdoors, ... | pv-magazine.com/2023/04/08/high-time-for-solar... |
| 21 | What are the obstacles for cannabis producers ... | βThere are a lot of prevailing headwinds for c... | pv-magazine.com/2023/04/08/high-time-for-solar... |
| 22 | In 2021, what were the top 3 states in the US ... | In 2021, Florida surpassed North Carolina to b... | cleantechnica.com/2023/04/10/solar-power-in-fl... |
| 23 | Which has the higher absorption coefficient fo... | We chose amorphous germanium instead of amorph... | pv-magazine.com/2021/01/15/germanium-based-sol... |
In the end we are able to find all the articles that are linked to the evaluation questions and have therefore successfully completed our exploratory data analysis and preprocessing.
For faster processing and to reduce the cost of running the notebook we will subsample the dataset to 1000 articles. This will allow us to run the notebook in a reasonable amount of time and still provide meaningful results. Because the distribution of articles across publishers is skewed we will use stratified sampling to ensure that we have a representative sample. We also need to keep in mind that the evaluation questions are linked to specific articles so we need to make sure that these are included in the subsample.
eval_articles_df = articles_df[articles_df["url"].isin(human_eval_df["url"])]
eval_articles_df.head()
| title | content | domain | url | article | lang | readability | |
|---|---|---|---|---|---|---|---|
| 6780 | LeclanchΓ©β s new disruptive battery boosts ene... | ['Energy storage company LeclanchΓ© ( SW.LECN) ... | energyvoice | sgvoice.energyvoice.com/strategy/technology/23... | Energy storage company LeclanchΓ© ( SW.LECN) ha... | en | 43.22 |
| 6805 | EU seeks competitive boost with Green Deal Ind... | ['The EU has presented its β Green Deal Indust... | energyvoice | sgvoice.energyvoice.com/policy/25396/eu-seeks-... | The EU has presented its β Green Deal Industri... | en | 34.70 |
| 16367 | Agrivoltaics Goes Nuclear On California Prairie | ['A decommissioned nuclear power plant from th... | cleantechnica | cleantechnica.com/2022/12/18/agrivoltaics-goes... | A decommissioned nuclear power plant from the ... | en | 42.00 |
| 16402 | The Wait For Hydrogen Fuel Cell Electric Aircr... | ['The US firm ZeroAvia is one step closer to b... | cleantechnica | cleantechnica.com/2023/01/02/the-wait-for-hydr... | The US firm ZeroAvia is one step closer to bri... | en | 50.46 |
| 16725 | Solar Power In Florida | ['Many renewable energy endeavors in Florida a... | cleantechnica | cleantechnica.com/2023/04/10/solar-power-in-fl... | Many renewable energy endeavors in Florida are... | en | 44.75 |
print(eval_articles_df["url"].unique().shape)
print(human_eval_df["url"].unique().shape)
(21,) (21,)
def do_stratification(
df: pd.DataFrame,
column: str,
sample_size: int,
seed: int = 42
) -> pd.DataFrame:
res_df = df.copy()
indx = df.groupby(column, group_keys=False)[column].apply(lambda x: x.sample(n=int(sample_size/len(df) * len(x)), random_state=seed)).index.to_list()
return res_df.loc[indx]
sample_df = do_stratification(articles_df, "domain", 1000, 69)
# if the articles are already in the subsample from the evaluation set, then we remove them, so we just want unique urls
sample_df = sample_df[~sample_df["url"].isin(eval_articles_df["url"])]
sample_df = pd.concat([sample_df, eval_articles_df])
sample_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 1011 entries, 38325 to 81779 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 title 1011 non-null object 1 content 1011 non-null object 2 domain 1011 non-null object 3 url 1011 non-null object 4 article 1011 non-null object 5 lang 1011 non-null object 6 readability 1011 non-null float64 dtypes: float64(1), object(6) memory usage: 63.2+ KB
original_domain_counts = articles_df["domain"].value_counts().to_frame()
original_domain_counts = original_domain_counts / original_domain_counts.sum() * 100
domain_counts_df = original_domain_counts.copy()
domain_counts_df["type"] = "Original"
sample_domain_counts = sample_df["domain"].value_counts().to_frame()
sample_domain_counts = sample_domain_counts / sample_domain_counts.sum() * 100
sample_domain_counts["type"] = "Sample"
domain_counts_df = pd.concat([domain_counts_df, sample_domain_counts])
sns.barplot(
x=domain_counts_df.index,
y=domain_counts_df["count"],
hue=domain_counts_df["type"]
)
plt.title("Domain Distribution")
plt.xlabel("Domain")
plt.ylabel("Percentage")
plt.xticks(rotation=90)
plt.show()
Chunking is a crucial step in the RAG pipeline. It involves breaking down the articles into smaller, more manageable pieces.

There are mainly two reasons for this:
Let's start by getting a better feeling for the most common size of chunks based on the number of characters
def get_lorem_text(num_chars: int) -> str:
expected_avg_word_len = 3 # on the lower side to be safe
text = lorem.words(num_chars // expected_avg_word_len)
return text[:num_chars]
print(wrap_text(get_lorem_text(256)))
nihil rerum debitis fuga optio est modi sunt ratione tempore voluptatem reprehenderit cumque qui quasi doloribus soluta accusamus similique id obcaecati sit incidunt molestiae eveniet quod repudiandae laudantium libero voluptas autem harum natus quas volup
print(wrap_text(get_lorem_text(512)))
facere earum laborum amet distinctio nam ipsum quibusdam minus fuga molestiae quis perferendis sed suscipit animi sequi aliquam nisi cumque nulla deserunt aut in quos sapiente corrupti dolorum enim modi repellendus at assumenda voluptatibus pariatur quaerat temporibus magnam recusandae numquam qui error nesciunt quae praesentium quia accusantium dicta nihil soluta voluptas quod excepturi est deleniti dignissimos expedita exercitationem ut ipsa magni voluptates ratione iure ducimus voluptatum eum dolores ven
print(wrap_text(get_lorem_text(1024)))
non rem officia beatae dolores consequuntur labore numquam sapiente ipsa nesciunt veniam quas nihil fugiat hic nisi animi dolorem tempore eum tempora dolore accusantium amet incidunt consectetur exercitationem id saepe accusamus eaque eligendi atque eveniet voluptates deserunt earum aut delectus magni quae corporis dolorum laborum dicta totam vel dolor cumque fuga vero voluptatum quibusdam nam quod temporibus neque aliquam architecto quidem eius suscipit soluta ex ab at cum adipisci sunt nostrum placeat harum omnis sint nobis ducimus ut facere quia laudantium culpa obcaecati sequi quo perspiciatis iusto odio minus libero mollitia nulla repellat aperiam eos enim officiis in asperiores provident porro est et voluptatibus itaque aspernatur a repellendus praesentium voluptas assumenda quasi qui voluptate autem quisquam velit odit reprehenderit ratione sit alias natus tenetur repudiandae modi reiciendis nemo debitis laboriosam error recusandae minima dignissimos molestias ea quis deleniti fugit explicabo ipsum rer
print(wrap_text(get_lorem_text(2048)))
facilis ea cum impedit nemo quo rerum facere temporibus excepturi exercitationem aut incidunt provident quos dolore iure quae ipsam placeat similique autem voluptatibus voluptate at quasi eius sapiente id culpa alias dicta nostrum optio quam aperiam officiis fugit repellat illum nam voluptates velit minus atque doloribus nobis est tempora debitis sunt dolorum vero odio inventore harum recusandae distinctio aliquam amet consectetur ullam nisi officia cupiditate suscipit laboriosam nesciunt nulla minima quisquam hic natus tenetur sed non laudantium ab soluta vitae explicabo vel quidem molestiae praesentium repudiandae sint reprehenderit dolor beatae ad fuga expedita quod dolorem mollitia magnam labore omnis laborum odit voluptas earum aliquid et assumenda perspiciatis saepe ratione corporis iusto totam neque cumque ipsa tempore modi molestias perferendis animi voluptatem quia nihil maxime consequuntur doloremque accusamus iste magni error veritatis a dignissimos necessitatibus eveniet dolores maiores unde illo libero quis consequatur voluptatum veniam adipisci delectus pariatur obcaecati enim corrupti deserunt quas eligendi porro itaque sequi sit reiciendis rem ducimus ipsum commodi accusantium aspernatur ut qui fugiat ex esse in asperiores eos quaerat quibusdam blanditiis eaque deleniti possimus architecto numquam eum repellendus quaerat vel veniam temporibus quam dicta blanditiis beatae qui ea non ut nulla quia hic est vitae maiores magni eligendi nisi error neque fuga ad ducimus impedit aut amet dolor voluptas explicabo adipisci dolore delectus eaque necessitatibus pariatur tempore consectetur consequuntur culpa sequi similique perspiciatis fugiat quisquam nesciunt quis laborum dignissimos voluptates possimus repellat ratione voluptatibus quidem facere in provident deleniti voluptatum rerum quibusdam ex ipsa commodi distinctio accusamus dolorem tempora ullam ipsum perferendis deserunt numquam corporis facilis unde voluptatem totam aliquid maxime excepturi mollitia officiis asperiores iste laudantium atque odit d
In this notebook we will be using two different chunking strategies:
To see how different texts get chunked with different strategies and chunk sizes check out the Chunking Visualizer.
def get_recursive_splitter(chunk_size: int, chunk_overlap: int) -> TextSplitter:
return RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
separators=["\n\n", "\n", "(?<=\. )", " ", ""],
length_function=len,
)
# the recursive splitter mainly relies on newlines, are there even any? No, so it will focus on sentences.
sample_df["article"].map(lambda x: x.count("\n")).sum()
0
# if we can make use of any device that is better than the CPU, we will use it
device = "cpu"
if torch.cuda.is_available():
device = "cuda"
elif torch.backends.mps.is_available():
device = "mps"
model_kwargs = {'device': device, "trust_remote_code": True}
model_kwargs
{'device': 'cuda', 'trust_remote_code': True}
embedding_models = {
"mini": HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2", model_kwargs=model_kwargs),
"bge-m3": HuggingFaceEmbeddings(model_name="BAAI/bge-m3", model_kwargs=model_kwargs),
"gte": HuggingFaceEmbeddings(model_name="Alibaba-NLP/gte-base-en-v1.5", model_kwargs=model_kwargs),
}
recursive_256_splitter = get_recursive_splitter(256, 64)
recursive_1024_splitter = get_recursive_splitter(1024, 128)
semantic_splitter = SemanticChunker(
embedding_models["mini"], breakpoint_threshold_type="percentile"
)
splitters = {
"recursive_256": recursive_256_splitter,
"recursive_1024": recursive_1024_splitter,
"semantic": semantic_splitter
}
def chunk_documents(df: pd.DataFrame, text_splitter: TextSplitter):
chunks = []
id = 0
for _, row in tqdm(df.iterrows(), total=len(df)):
article_content = row['article']
title = row['title']
# we add the title to the content as it might be relevant to the question
full_text = title + ": " + article_content
char_chunks = text_splitter.split_text(full_text)
for chunk in char_chunks:
id += 1
# add metadata to the chunk for potential later use
metadata = {
'title': row['title'],
'url': row['url'],
'domain': row['domain'],
'id': id,
}
chunks.append(Document(
page_content=chunk,
metadata=metadata,
))
return chunks
chunks_folder = silver_folder / "chunks"
if not chunks_folder.exists():
chunks_folder.mkdir()
def get_or_create_chunks(df: pd.DataFrame, text_splitter: TextSplitter, splitter_name: str) -> List[Document]:
chunks_file = chunks_folder / f"{splitter_name}_chunks.json"
if chunks_file.exists():
with open(chunks_file, "r") as file:
chunks = [Document(**chunk) for chunk in json.load(file)]
print(f"Loaded {len(chunks)} chunks from {chunks_file}")
else:
chunks = chunk_documents(df, text_splitter)
with open(chunks_file, "w") as file:
json.dump([doc.dict() for doc in chunks], file, indent=4)
print(f"Saved {len(chunks)} chunks to {chunks_file}")
return chunks
chunks = {}
for splitter_name, splitter in splitters.items():
chunks[splitter_name] = get_or_create_chunks(sample_df, splitter, splitter_name)
Loaded 25399 chunks from data/silver/chunks/recursive_256_chunks.json Loaded 5754 chunks from data/silver/chunks/recursive_1024_chunks.json Loaded 3146 chunks from data/silver/chunks/semantic_chunks.json
Now that we have created and saved the chunks we can analyze them. We can already see above that the semantic chunks are generally larger than the recursive chunks.
Let's start by looking at the first chunk of the first article to get a feeling for what the chunks look like depending on the chunking strategy and then we will look at the distribution of the chunk sizes and the number of chunks per article.
for splitter_name, splitter_chunks in chunks.items():
print(f"{splitter_name} chunks:")
print(wrap_text(splitter_chunks[0].page_content, char_per_line=150))
print()
recursive_256 chunks: LeclanchΓ©β s new disruptive battery boosts energy density: Energy storage company LeclanchΓ© ( SW.LECN) has designed a new battery cell that uses less cobalt and boosts energy density by 20%. The company says it is also produced in an environmentally recursive_1024 chunks: LeclanchΓ©β s new disruptive battery boosts energy density: Energy storage company LeclanchΓ© ( SW.LECN) has designed a new battery cell that uses less cobalt and boosts energy density by 20%. The company says it is also produced in an environmentally friendly way, making it more recyclable or easy to dispose of at end-of-life. LeclanchΓ© said it has developed an environmentally friendly way to produce lithium-ion ( Li-ion) batteries. It has replaced highly toxic organic solvents, commonly used in the production process, with a water-based process to make nickel-manganese-cobalt-aluminium cathodes ( NMCA). Organic solvents, such as N-methyl pyrrolidone ( NMP), are highly toxic and harmful to the environment. The use of NMP has been restricted by the European Commission, having been added to the list of Substances of Very High Concern, which can have serious irreversible effects on human health and the environment. Besides being technically simpler, eliminating the use of organic solvents also eliminates the risk semantic chunks: LeclanchΓ©β s new disruptive battery boosts energy density: Energy storage company LeclanchΓ© ( SW.LECN) has designed a new battery cell that uses less cobalt and boosts energy density by 20%. The company says it is also produced in an environmentally friendly way, making it more recyclable or easy to dispose of at end-of-life. LeclanchΓ© said it has developed an environmentally friendly way to produce lithium-ion ( Li-ion) batteries. It has replaced highly toxic organic solvents, commonly used in the production process, with a water-based process to make nickel-manganese-cobalt-aluminium cathodes ( NMCA). Organic solvents, such as N-methyl pyrrolidone ( NMP), are highly toxic and harmful to the environment. The use of NMP has been restricted by the European Commission, having been added to the list of Substances of Very High Concern, which can have serious irreversible effects on human health and the environment. Besides being technically simpler, eliminating the use of organic solvents also eliminates the risk of explosion, making the production process safer for employees. LeclanchΓ© claims to be a global pioneer in the field, having used aqueous binders in its for over a decade.
def plot_chunk_lengths(chunks: List[Document], title: str):
sns.histplot([len(chunk.page_content) for chunk in chunks], kde=True)
plt.title(title)
plt.xlabel("Chunk length")
plt.ylabel("Number of chunks")
median_chunk_len = np.median([len(chunk.page_content) for chunk in chunks])
mean_chunk_len = np.mean([len(chunk.page_content) for chunk in chunks])
plt.axvline(median_chunk_len, color='r', linestyle='--', label=f"Median chunk length: {median_chunk_len:.2f}")
plt.axvline(mean_chunk_len, color='g', linestyle='--', label=f"Mean chunk length: {mean_chunk_len:.2f}")
plt.legend()
plt.show()
plot_chunk_lengths(chunks["recursive_256"], "Chunk lengths for recursive 256 splitter")
plot_chunk_lengths(chunks["recursive_1024"], "Chunk lengths for recursive 1024 splitter")
plot_chunk_lengths(chunks["semantic"], "Chunk lengths for semantic splitter")
chunks_per_article = {splitter_name: Counter([chunk.metadata["title"] for chunk in chunks]) for splitter_name, chunks in chunks.items()}
counts = {splitter_name: [count for title, count in chunk_counts.items()] for splitter_name, chunk_counts in chunks_per_article.items()}
sns.histplot(counts, kde=True)
plt.title("Number of chunks per article")
plt.xlabel("Number of chunks")
plt.ylabel("Number of articles")
plt.legend(chunks_per_article.keys())
plt.show()
From our analysis of our created chunks we can see that the recursive chunks are all around the same size, close to the defined maximum. On the other hand, the semantic chunks vary in size. This is because the semantic chunking strategy is based on the semantic boundaries of the article.
We can also see that despite the semantic chunks being larger, the distribution of the number of chunks per article is much wider for the recursive chunks. This is because the recursive chunks are all around the same size, while the semantic chunks have many smaller ones and a few larger ones.
Now that we have clean chunks, the next step involves generating embeddings for our article chunks. These embeddings will serve as a crucial component for efficient retrieval within the RAG pipeline. For our vector store we'll utilize ChromaDB, a powerful tool for indexing and searching high-dimensional data. To integrate our chosen embedding models with ChromaDB, we'll define a custom wrapper class. This wrapper class will act as an intermediary, ensuring seamless communication between the models and the ChromaDB indexing system.
class CustomChromadbEmbeddingFunction(EmbeddingFunction):
def __init__(self, model) -> None:
super().__init__()
self.model = model
def _embed(self, l):
return [self.model.embed_query(x) for x in l]
def embed_query(self, query):
return self._embed([query])
def __call__(self, input: Documents) -> Embeddings:
embeddings = self._embed(input)
return embeddings
chroma_embedding_functions = {
"mini": CustomChromadbEmbeddingFunction(embedding_models["mini"]),
"bge-m3": CustomChromadbEmbeddingFunction(embedding_models["bge-m3"]),
"gte": CustomChromadbEmbeddingFunction(embedding_models["gte"]),
}
for name, embedding_function in chroma_embedding_functions.items():
sample = embedding_function(["Hello, world!"])[0][:5]
print(f"{name} embedding sample: {sample}")
mini embedding sample: [0.034922659397125244, 0.01883005164563656, -0.017854738980531693, 0.00013884028885513544, 0.0740736573934555] bge-m3 embedding sample: [-0.016155630350112915, 0.02699342556297779, -0.04258322715759277, 0.013542207889258862, -0.019354630261659622] gte embedding sample: [0.03789481893181801, 0.3469243049621582, -0.2047133892774582, -0.21238623559474945, -0.49100759625434875]
Generating embeddings can be a computationally intensive process. To optimize efficiency and avoid redundant computations, we'll leverage checkpointing. This technique involves storing the generated embeddings along with their corresponding article chunks. We'll define a simple class to encapsulate this data, facilitating efficient retrieval and reducing the need for recalculating embeddings unless absolutely necessary.
embeddings_folder = silver_folder / "embeddings"
if not embeddings_folder.exists():
embeddings_folder.mkdir()
class DocumentEmbedding():
def __init__(self, document: Document, text_embedding: List[float]) -> None:
self.document = document
self.text_embedding = text_embedding
def to_dict(self) -> Dict:
return {
"document": self.document.dict(),
"text_embedding": self.text_embedding
}
@classmethod
def from_dict(cls, d: Dict) -> "DocumentEmbedding":
return cls(
document=Document(**d["document"]),
text_embedding=d["text_embedding"]
)
def get_or_create_embeddings(
embedding_function: EmbeddingFunction,
chunks: List[Document],
embedding_name: str,
) -> List[DocumentEmbedding]:
embeddings_file = embeddings_folder / f"{embedding_name}_embeddings.json"
if embeddings_file.exists():
with open(embeddings_file, "r") as file:
embeddings = [DocumentEmbedding.from_dict(embedding) for embedding in json.load(file)]
print(f"Loaded {len(embeddings)} embeddings from {embeddings_file}")
else:
embeddings = []
for chunk in tqdm(chunks):
text_embedding = embedding_function([chunk.page_content])[0]
embedding = DocumentEmbedding(
document=chunk,
text_embedding=text_embedding
)
embeddings.append(embedding)
with open(embeddings_file, "w") as file:
json.dump([embedding.to_dict() for embedding in embeddings], file, indent=4)
print(f"Saved {len(embeddings)} embeddings to {embeddings_file}")
return embeddings
embeddings = {}
for embedding_name, embedding_function in chroma_embedding_functions.items():
for splitter_name, splitter_chunks in chunks.items():
embeddings[f"{embedding_name}_{splitter_name}"] = get_or_create_embeddings(
embedding_function, splitter_chunks, f"{embedding_name}_{splitter_name}"
)
Loaded 25399 embeddings from data/silver/embeddings/mini_recursive_256_embeddings.json Loaded 5754 embeddings from data/silver/embeddings/mini_recursive_1024_embeddings.json Loaded 3146 embeddings from data/silver/embeddings/mini_semantic_embeddings.json Loaded 25399 embeddings from data/silver/embeddings/bge-m3_recursive_256_embeddings.json Loaded 5754 embeddings from data/silver/embeddings/bge-m3_recursive_1024_embeddings.json Loaded 3146 embeddings from data/silver/embeddings/bge-m3_semantic_embeddings.json Loaded 25399 embeddings from data/silver/embeddings/gte_recursive_256_embeddings.json Loaded 5754 embeddings from data/silver/embeddings/gte_recursive_1024_embeddings.json Loaded 3146 embeddings from data/silver/embeddings/gte_semantic_embeddings.json
As mentioned above for our semantic search retrieval we will be storing the embeddings in ChromaDB. ChromaDB is a powerful tool for indexing and searching high-dimensional data. It is based on the Hierarchical Navigable Small World (HNSW) algorithm, which is known for its efficiency in searching high-dimensional spaces.
Just like with normal sql databases we have a server, in this case an sqllite server, that we can connect to with a client. We will then use the client to connect to the server and create for each set of embeddings a new seperate database which can be thought of as the index or a vector space. Chromadb calls these vector spaces "collections". These collections will then be used to search for the most relevant chunks to a user query.

gold_folder = data_folder / "gold"
if not gold_folder.exists():
gold_folder.mkdir()
chromadb_folder = gold_folder / "chromadb"
if not chromadb_folder.exists():
chromadb_folder.mkdir()
chroma_client = chromadb.PersistentClient(path=chromadb_folder.as_posix())
def get_or_create_collection(
name: str,
embedding_function: EmbeddingFunction,
embeddings: List[DocumentEmbedding],
batch_size: int = 128
) -> Collection:
collection = chroma_client.get_or_create_collection(
name=name,
# configure to use cosine distance not default L2
metadata={"hnsw:space": "cosine"},
embedding_function=embedding_function
)
if collection.count() == 0:
for i in tqdm(range(0, len(embeddings), batch_size)):
batch = embeddings[i:i+batch_size]
collection.add(
documents=[embedding.document.page_content for embedding in batch],
embeddings=[embedding.text_embedding for embedding in batch],
ids=[str(embedding.document.metadata["id"]) for embedding in batch],
metadatas=[embedding.document.metadata for embedding in batch]
)
return collection
collections = {}
for collection_name, current_embeddings in embeddings.items():
collection = get_or_create_collection(
collection_name,
chroma_embedding_functions[collection_name.split("_")[0]],
current_embeddings
)
collections[collection_name] = collection
print(f"Collection {collection_name} has {collection.count()} documents")
Collection mini_recursive_256 has 25399 documents Collection mini_recursive_1024 has 5754 documents Collection mini_semantic has 3146 documents Collection bge-m3_recursive_256 has 25399 documents Collection bge-m3_recursive_1024 has 5754 documents Collection bge-m3_semantic has 3146 documents Collection gte_recursive_256 has 25399 documents Collection gte_recursive_1024 has 5754 documents Collection gte_semantic has 3146 documents
Once we have stored all the embeddings in ChromaDB we can test the retrieval process by querying one of our collections and see what the most similar chunks are. Try some different queries and see what the most similar chunks are and whether they make sense.
selected_collection = collections["gte_recursive_1024"]
results = selected_collection.query(
query_texts=["Climate Change"],
n_results=3,
)
for doc in results["documents"][0]:
print(wrap_text(doc))
print()
Report of the Intergovernmental Panel on Climate Change ( IPCC) makes for grim reading. It warns that the world is heading for calamitous temperature rises and points to the need for economies to decarbonise. The UK has set firm and ambitious targets and a pathway to net zero and CCUS will be one of the tools which is used to achieve this. scenario used in the study is unlikely because of global efforts to limit greenhouse gas emissions, the findings reveal a previously unknown tipping point that if activated would release an important brake on global warming, the authors said. `` We need to think about these worst-case scenarios to understand how our CO2 emissions might affect the oceans not just this century, but next century and the following century, '' said Megumi Chikamoto, who led the research as a research fellow at the University of Texas Institute for Geophysics. The study was published in the journal Geophysical Research Letters. Today, the oceans soak up about a third of the CO2 emissions generated by humans. Climate simulations had previously shown that the oceans slow their absorption of CO2 over time, but none had considered alkalinity as explanation. To reach their conclusion, the researchers recalculated pieces of a 450-year simulation until they hit on alkalinity as a key cause of the slowing. According to the findings, the Potential Climatic Impact of Nord Stream Methane Leaks: Nord Stream 1 and 2, two subsea pipelines that transport natural gas from Russia to Germany, were both intentionally destroyed on September 26th, 2022. Enormous amounts of gases, mainly methane, were discharged into the ocean and eventually into the atmosphere. Methane escaping from sabotaged pipelines in the Baltic Sea ( September 27th, 2022). Image Credit: Danish Armed Forces Methane is the second most prevalent anthropogenic greenhouse gas after CO2, although its greenhouse effect is substantially stronger. As a result, whether this catastrophe may have detrimental climatic consequences is a major issue around the world. This problem was discussed in a news article published in Nature, but no quantitative implications were reached. Recently, scientists from the Chinese Academy of Sciencesβ Institute of Atmospheric Physics approximated the potential climatic effect of leaked methane using the energy-conservation framework of the Intergovernmental
To gain a better understandign of how the retrieval process works we will analyze the embedding space. We will start by projecting the embeddings into a 2D space using UMAP. UMAP is a dimensionality reduction technique that is particularly well-suited for visualizing high-dimensional data in a lower-dimensional space. We will then use the UMAP embeddings to create a scatter plot of the chunks.
def get_vectors_from_collection(collection: Collection):
stored_chunks = collection.get(include=["documents", "metadatas", "embeddings"])
return np.array(stored_chunks["embeddings"])
def get_vectors_by_domain(collection: Collection, domain: str):
stored_chunks = collection.get(include=["documents", "metadatas", "embeddings"])
metadatas = stored_chunks["metadatas"]
indices = [str(metadata["id"]) for metadata in metadatas if metadata["domain"] == domain]
return collection.get(include=["embeddings"], ids=indices)["embeddings"]
def fit_umap(vectors: np.ndarray):
return umap.UMAP().fit(vectors)
def project_embeddings(embeddings, umap_transform):
return umap_transform.transform(embeddings)
vectors = get_vectors_from_collection(selected_collection)
print(f"Original shape: {vectors.shape}")
umap_transform = fit_umap(vectors)
vectors_projections = project_embeddings(vectors, umap_transform)
print(f"Projected shape: {vectors_projections.shape}")
Original shape: (5754, 768) Projected shape: (5754, 2)
You can zoom in the plot by clicking and dragging a box around the area you want to zoom in on. You can also reset the plot by double clicking on the plot.
fig = px.scatter(x=vectors_projections[:, 0], y=vectors_projections[:, 1])
fig.show()
Next we will color the embeddings by the domain of the article to see if there are any patterns or clusters in the embedding space based on the domain.
fig = go.Figure()
for domain in sample_df["domain"].unique():
domain_vectors = get_vectors_by_domain(selected_collection, domain)
domain_projections = project_embeddings(domain_vectors, umap_transform)
fig.add_trace(go.Scatter(x=domain_projections[:, 0], y=domain_projections[:, 1], mode='markers', marker=dict(size=4), name=domain))
fig.show()
We can also visualize the retrieval process by plotting the query and the most similar chunks in the embedding space. This will give us a better understanding of how the retrieval process works and how the most similar chunks are found. Don't forget that the embeddings are in a high-dimensional space and we are only visualizing a 2D projection of them so the distances between the points might not be accurate. Try some different queries and see how the most similar chunks are found.
def plot_retrieval_results(
query: str,
selected_collection: Collection,
n_results: int = 5
):
vectors = get_vectors_from_collection(selected_collection)
umap_transform = fit_umap(vectors)
vectors_projections = project_embeddings(vectors, umap_transform)
query_embedding = selected_collection._embedding_function([query])[0]
query_embedding = np.array(query_embedding).reshape(1, -1)
query_projection = project_embeddings(query_embedding, umap_transform)
nearest_neighbors = selected_collection.query(
query_texts=[query],
n_results=n_results,
)
neighbor_vectors = selected_collection.get(include=["embeddings"], ids=nearest_neighbors["ids"][0])["embeddings"]
neighbor_projections = project_embeddings(neighbor_vectors, umap_transform)
fig = go.Figure()
fig.add_trace(go.Scatter(x=vectors_projections[:, 0], y=vectors_projections[:, 1], mode='markers', marker=dict(size=5), name="other vectors"))
fig.add_trace(go.Scatter(x=neighbor_projections[:, 0], y=neighbor_projections[:, 1], mode='markers', marker=dict(size=5, color='orange'), name="nearest neighbors"))
fig.add_trace(go.Scatter(x=query_projection[:, 0], y=query_projection[:, 1], mode='markers', marker=dict(size=10, color='red', symbol='x'), name="query"))
fig.show()
plot_retrieval_results(
"Climate Change",
selected_collection,
)
Lastly we will analyze the distribution of the cosine distances between the query and the different chunks. This will give us a better understanding of the cosine distance and show that the distances in the high-dimensional space are not the same as in the 2D projection. Do not confuse the cosine distance with the cosine similarity. The cosine similarity is the cosine of the angle between two vectors and the cosine distance is 1 minus the cosine similarity so that smaller numbers mean the vectors are more similar.
def cosine_distance(vector1, vector2):
dot_product = np.dot(vector1, vector2.T)
norm_product = np.linalg.norm(vector1) * np.linalg.norm(vector2)
similarity = dot_product / norm_product
return 1 - similarity
def plot_cosine_distances(
query: str,
selected_collection: Collection
):
vectors = get_vectors_from_collection(selected_collection)
umap_transform = fit_umap(vectors)
vectors_projections = project_embeddings(vectors, umap_transform)
query_embedding = selected_collection._embedding_function([query])[0]
query_embedding = np.array(query_embedding).reshape(1, -1)
query_projection = project_embeddings(query_embedding, umap_transform)
similarities = np.array([cosine_distance(query_embedding, vector) for vector in vectors])
fig = go.Figure()
fig.add_trace(go.Scatter(
x=vectors_projections[:, 0],
y=vectors_projections[:, 1],
mode='markers',
marker=dict(
size=5,
color=similarities.flatten(),
colorscale='RdBu',
colorbar=dict(title='Cosine Distance')
),
text=['Cosine Distance: {:.4f}'.format(
sim) for sim in similarities.flatten()],
name='Other Vectors'
))
fig.add_trace(go.Scatter(x=[query_projection[0][0]], y=[
query_projection[0][1]], mode='markers', marker=dict(size=10, color='black', symbol='x'), text=['Query Vector'], name='Query Vector'))
fig.show()
plot_cosine_distances(
"Climate Change",
selected_collection,
)
Now that we have generated the embeddings and stored them in ChromaDB we can put it all together and create the RAG pipeline. The RAG pipeline consists of the following steps:
In this notebook we will be using Langchain to build up our pipeline. You do not need a library like Langchain or LlamaIndex to build a RAG pipeline, but it can make the process easier.
The idea of Langchain and its LCEL (Langchain Expression Language) is very simple. Within the pipeline there are lots of steps that take an input and produce an output. These steps can be chained together to form a pipeline. The LCEL is a simple language that allows you to define these steps and how they are connected. For more technical details on how Langchain works check out the Langchain Documentation.
In simple terms langchain provides an abstraction of a step that has an invoke method that takes an input, a dictionary of parameters and returns an output also a dictionary. This allows you to chain together different steps and define how they are connected and also split of chains of steps into separate pipelines.
Below you can see an overview of our RAG pipeline:

def create_qa_chain(retriever: BaseRetriever):
template = """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. \
If you don't know the answer, just say that you don't know. Keep the answer concise.
Question: {question}
Context: {context}
Answer:
"""
rag_prompt = ChatPromptTemplate.from_template(template)
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
rag_chain = RunnableParallel(
{
"context": retriever,
"question": RunnablePassthrough()
}
).assign(answer=(
RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
| rag_prompt
| llm
| StrOutputParser()
))
return rag_chain
For Langchain to work with our chromadb collections we need to transform the collections into a format that Langchain can work with so called stores and retrievers.
def collection_to_store(collection_name: str, lc_embedding_model: EmbeddingFunction):
return Chroma(
client=chroma_client,
collection_name=collection_name,
embedding_function=lc_embedding_model,
)
def store_to_retriever(store: VectorStore, k: int = 3):
retriever = store.as_retriever(
search_type="similarity", search_kwargs={'k': k}
)
return retriever
selected_store = collection_to_store("gte_recursive_1024", embedding_models["gte"])
selected_retriever = store_to_retriever(selected_store)
selected_retriever.invoke("Climate Change")
[Document(page_content='Report of the Intergovernmental Panel on Climate Change ( IPCC) makes for grim reading. It warns that the world is heading for calamitous temperature rises and points to the need for economies to decarbonise. The UK has set firm and ambitious targets and a pathway to net zero and CCUS will be one of the tools which is used to achieve this.', metadata={'domain': 'energyvoice', 'id': 3173, 'title': 'The 10 Point Pod delves deep into the heart of CCUS', 'url': 'energyvoice.com/promoted/347021/the-10-point-pod-delves-deep-into-the-heart-of-ccus'}),
Document(page_content="scenario used in the study is unlikely because of global efforts to limit greenhouse gas emissions, the findings reveal a previously unknown tipping point that if activated would release an important brake on global warming, the authors said. `` We need to think about these worst-case scenarios to understand how our CO2 emissions might affect the oceans not just this century, but next century and the following century, '' said Megumi Chikamoto, who led the research as a research fellow at the University of Texas Institute for Geophysics. The study was published in the journal Geophysical Research Letters. Today, the oceans soak up about a third of the CO2 emissions generated by humans. Climate simulations had previously shown that the oceans slow their absorption of CO2 over time, but none had considered alkalinity as explanation. To reach their conclusion, the researchers recalculated pieces of a 450-year simulation until they hit on alkalinity as a key cause of the slowing. According to the findings, the", metadata={'domain': 'azocleantech', 'id': 633, 'title': 'Global Warming Could Trigger Chemical Changes in the Ocean Surface that Accelerate Climate Change', 'url': 'azocleantech.com/news.aspx?newsID=33053'}),
Document(page_content='Potential Climatic Impact of Nord Stream Methane Leaks: Nord Stream 1 and 2, two subsea pipelines that transport natural gas from Russia to Germany, were both intentionally destroyed on September 26th, 2022. Enormous amounts of gases, mainly methane, were discharged into the ocean and eventually into the atmosphere. Methane escaping from sabotaged pipelines in the Baltic Sea ( September 27th, 2022). Image Credit: Danish Armed Forces Methane is the second most prevalent anthropogenic greenhouse gas after CO2, although its greenhouse effect is substantially stronger. As a result, whether this catastrophe may have detrimental climatic consequences is a major issue around the world. This problem was discussed in a news article published in Nature, but no quantitative implications were reached. Recently, scientists from the Chinese Academy of Sciencesβ Institute of Atmospheric Physics approximated the potential climatic effect of leaked methane using the energy-conservation framework of the Intergovernmental', metadata={'domain': 'azocleantech', 'id': 614, 'title': 'Potential Climatic Impact of Nord Stream Methane Leaks', 'url': 'azocleantech.com/news.aspx?newsID=32568'})]
Now that we have our retriever we can create our RAG pipeline. Try some different queries and see how the pipeline responds.
selected_chain = create_qa_chain(selected_retriever)
selected_chain.invoke("Where are the biggest increases in wildfire smoke exposure in recent years?")
{'context': [Document(page_content='Blue River, Vida, Phoenix, and Talentβwere lost to the so-called Labor Day Fires in 2020. And in 2021, the Lytton Creek Fire wiped out the village of Lytton, British Columbia, destroying hundreds of homes. All told, between 2017 and 2021, nearly 120,000 fires burned across western North America, burning nearly 39 million acres of land and claiming more than 60,000 structures. The impacts of wildfires reach well beyond the people, communities, and ecosystems that are directly affected by flames. Wildfires have consequences for public health, water supplies, and economies long after a fire is extinguished. Mounting research is showing exposure to the fine particulate matter in wildfire smoke is responsible for thousands of indirect deaths, increases to the risk of pre-term birth among pregnant women, and even an increase in the risk of COVID-19 illness and death. Surprisingly, some of the biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas. In', metadata={'domain': 'cleantechnica', 'id': 57, 'title': 'What Does A β Normal β Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'}),
Document(page_content='the biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas. In addition to their impact on air quality, wildfires can disrupt processes that maintain access to drinking water, including by reducing the ability of soil to absorb water when it rains and sending additional sediment into drinking water systems. The science is clear that climate change is increasing whatβ s known as the β vapor pressure deficit, β or VPD, across western North America. When VPD is high, the atmosphere can pull more water out of plants, which dries them out and makes them more likely to burn. VPD also is also a good metric for drought, including the long 21st century drought the region has been experiencing. But climate isnβ t the only factor behind the westβ s worsening wildfires. More than a century of aggressive fire suppression and an even longer period of settler colonial repression of Indigenous burning practices have led to forests that are too dense, too', metadata={'domain': 'cleantechnica', 'id': 58, 'title': 'What Does A β Normal β Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'}),
Document(page_content='even longer period of settler colonial repression of Indigenous burning practices have led to forests that are too dense, too uniform in their species, and without the resistance to fire they once had. Moreover, a lack of affordable housing across the region and the desire for proximity to beautiful, natural places has led to large increases in the number of people living in wildfire-prone areas. Recent research has found that human activities were responsible for starting more than 80% of all wildfires in the United States, while also increasing the length of the fire season. This yearβ s wildfire season may offer the western US a chance to catch its breath after several years of record-breaking fires. But with climate change expected to deepen the hot, dry conditions that enable such record-breaking fires, we must be preparing for a future with even more fire. Carly Phillips contributed to this post. We publish a number of guest posts from experts in a large variety of fields. This is our contributor', metadata={'domain': 'cleantechnica', 'id': 59, 'title': 'What Does A β Normal β Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'})],
'question': 'Where are the biggest increases in wildfire smoke exposure in recent years?',
'answer': 'The biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas.'}
chains = {}
for collection_name, collection in collections.items():
store = collection_to_store(collection_name, embedding_models[collection_name.split("_")[0]])
retriever = store_to_retriever(store)
chain = create_qa_chain(retriever)
chains[collection_name] = chain
chains.keys()
dict_keys(['mini_recursive_256', 'mini_recursive_1024', 'mini_semantic', 'bge-m3_recursive_256', 'bge-m3_recursive_1024', 'bge-m3_semantic', 'gte_recursive_256', 'gte_recursive_1024', 'gte_semantic'])
Because we have many hyperparameters such as chunk size, prompts etc. to tune and different strategies to try we will use the RAGAS (RAG Assesment) framework to evaluate our pipeline. RAGAS is a framework that allows you to evaluate your RAG pipeline with an LLM as a judge and other metrics that also utilize embedding models. We will go more into detail on the metrics later on.
Before we can start the evaluation we need to define the evaluation questions and their ground truth answers. For this we will use the provided evaluation questions. To increase our question pool we will also generate some additional question and answer pairs based on a random chunk and utilizing the LLM to generate the question and answer.
human_eval_df.head()
| question | relevant_section | url | |
|---|---|---|---|
| example_id | |||
| 1 | What is the innovation behind LeclanchΓ©'s new ... | LeclanchΓ© said it has developed an environment... | sgvoice.energyvoice.com/strategy/technology/23... |
| 2 | What is the EUβs Green Deal Industrial Plan? | The Green Deal Industrial Plan is a bid by the... | sgvoice.energyvoice.com/policy/25396/eu-seeks-... |
| 3 | What is the EUβs Green Deal Industrial Plan? | The European counterpart to the US Inflation R... | pv-magazine.com/2023/02/02/european-commission... |
| 4 | What are the four focus areas of the EU's Gree... | The new plan is fundamentally focused on four ... | sgvoice.energyvoice.com/policy/25396/eu-seeks-... |
| 5 | When did the cooperation between GM and Honda ... | What caught our eye was a new hookup between G... | cleantechnica.com/2023/05/08/general-motors-se... |
def generate_eval_answers(df: pd.DataFrame) -> pd.DataFrame:
answer_geneation_prompt = """Answer the following question based on the article:
Question: {question}
Article: {article}
"""
answer_generation_chain = ChatPromptTemplate.from_template(answer_geneation_prompt) | llm
for i, row in tqdm(df.iterrows(), total=len(df)):
df.at[i, "ground_truth"] = answer_generation_chain.invoke({"question": row["question"], "article": row["relevant_section"]}).content
return df
if (silver_folder / "human_eval.csv").exists():
human_eval_df = pd.read_csv(silver_folder / "human_eval.csv")
else:
human_eval_df = generate_eval_answers(human_eval_df)
human_eval_df.to_csv(silver_folder / "human_eval.csv", index=False)
human_eval_df.head()
| question | relevant_section | url | ground_truth | |
|---|---|---|---|---|
| 0 | What is the innovation behind LeclanchΓ©'s new ... | LeclanchΓ© said it has developed an environment... | sgvoice.energyvoice.com/strategy/technology/23... | The innovation behind LeclanchΓ©'s new method t... |
| 1 | What is the EUβs Green Deal Industrial Plan? | The Green Deal Industrial Plan is a bid by the... | sgvoice.energyvoice.com/policy/25396/eu-seeks-... | The EUβs Green Deal Industrial Plan is an init... |
| 2 | What is the EUβs Green Deal Industrial Plan? | The European counterpart to the US Inflation R... | pv-magazine.com/2023/02/02/european-commission... | The EUβs Green Deal Industrial Plan is aimed a... |
| 3 | What are the four focus areas of the EU's Gree... | The new plan is fundamentally focused on four ... | sgvoice.energyvoice.com/policy/25396/eu-seeks-... | The four focus areas of the EU's Green Deal In... |
| 4 | When did the cooperation between GM and Honda ... | What caught our eye was a new hookup between G... | cleantechnica.com/2023/05/08/general-motors-se... | The cooperation between GM and Honda on fuel c... |
def generate_synthetic_qa_pairs(documents: List[Document], n: int = 10) -> List[str]:
synthetic_questions = []
documents = np.random.choice(documents, n)
question_generation_prompt = """Generate a short and general question based on the following news article:
Article: {article}
"""
question_generation_chain = ChatPromptTemplate.from_template(question_generation_prompt) | llm
answer_geneation_prompt = """Answer the following question based on the article:
Question: {question}
Article: {article}
"""
answer_generation_chain = ChatPromptTemplate.from_template(answer_geneation_prompt) | llm
for document in tqdm(documents):
element = {}
content = document.page_content
element["relevant_section"] = content
element["url"] = document.metadata["url"]
question = question_generation_chain.invoke({"article": content}).content
element["question"] = question
answer = answer_generation_chain.invoke({"question": question, "article": content}).content
element["ground_truth"] = answer
synthetic_questions.append(element)
return pd.DataFrame(synthetic_questions)
if not (silver_folder / "synthetic_eval.csv").exists():
synthetic_eval_df = generate_synthetic_qa_pairs(chunks["recursive_1024"], 25)
synthetic_eval_df.to_csv(silver_folder / "synthetic_eval.csv", index=False)
else:
synthetic_eval_df = pd.read_csv(silver_folder / "synthetic_eval.csv", index_col=0)
synthetic_eval_df.head()
| url | question | ground_truth | |
|---|---|---|---|
| relevant_section | |||
| Climate Shifts Forcefully Against Big Oil: The relationship between Big Oil and society is fundamentally changing. Public companies on both sides of the Atlantic are coming under a level of pressure to decarbonize their operations that was unthinkable just a year or two ago ( PIW Aug.7'20). This pressure is being wielded by investors as well as by court systems in some jurisdictions. The impact to corporate strategies could be enormous if companies feel they must respond to the heat by unwinding oil and gas operations earlier than planned. In one pivotal day this week, Exxon Mobil saw the tiny Engine No. 1 hedge fund unseat two -- and possibly three -- of its directors by harnessing the voting power of major pension and index funds. Chevron became the latest US company asked to set Scope 3 emissions reduction targets, following similar votes at ConocoPhillips and Phillips 66. And Royal Dutch Shell lost a Dutch court case that could force it to slash emissions by 45% by 2030, and redefine the obligations of | energyintel.com/0000017b-a7dd-de4c-a17b-e7df5a... | How is the relationship between Big Oil compan... | Big Oil companies are facing increasing pressu... |
| Government consults on changes to supply chain plans and CfD delivery: The UK government is consulting on changes to supply chain plan and Contracts for Difference ( CfD) policy in preparation for its fifth auction round. Launched February 4, the consultation by the Department for Business, Energy and Industrial Strategy ( BEIS) aims to garner industry input to make the CfD process β more adaptable and forward looking. β BEIS is inviting view views on the questions and pass threshold for the supply chain plan ( SCP) questionnaire, including the mooted introduction of interviews as part of the process; extending supply chain policy to support emerging technologies, starting with floating offshore wind projects; strengthening its disincentives for non-delivery; and amending Regulation 51 ( 10) ( c) of the CfD regulations which govern proposed project commissioning dates. Currently, developers aiming to build projects of 300MW or more must apply for an SCP statement from the Secretary of State for BEIS to take | energyvoice.com/renewables-energy-transition/3... | What changes are being proposed by the UK gove... | The UK government is proposing changes to supp... |
| volumes at the plant were now averaging 140 metric tons per day versus a 150 metric ton/d target. A Tepco spokesperson told Energy Intelligence Feb. 12 that relatively lower amounts of heavy rainfall in typhoons last fall helped ease pressure, but confirmed that `` it is still necessary to systematically construct necessary facilities, β such as more tanks, and β effectively use the entire site. '' `` They have been working very hard to successfully reduce the inflow of groundwater and rainwater that enter the building basements, becoming contaminated, requiring processing, and eventual additional tank storage, '' Lake Barrett, a senior adviser to Tepco, told Energy Intelligence. He added that this is a relatively dry season so short-term numbers may be misleading but that the `` trend is solidly downward and they are continuing actions to even further reduce in-leakage. '' This is giving `` the government more flexibility to find the least bad time to decide something, '' but a decision on water | energyintel.com/0000017b-a7dc-de4c-a17b-e7de9e... | What measures are being taken by Tepco to redu... | Tepco is taking measures such as constructing ... |
| rules. β I think that BOEM is really looking to the industry to help with the development of these regulations, β De Cort told an audience at OTC. β Weβ ve had some unofficial calls for where people want to look for these things. β Exxon Mobil made waves last year in oil and gas Lease Sale 257 when it bid on more than 90 shallow-water blocks that sources say could be the β sweet spot β in the Gulf for injecting CO2 because of their geological characteristics. Unfortunately for Exxon, which is planning a $ 100 billion CCS project in the Houston Ship Channel, that auction was annulled by a federal court, meaning the company will likely not get to enjoy any first-mover advantage it had in attempting to secure that acreage. | energyintel.com/00000180-9b95-d98b-adb6-ff9590... | What role is the oil and gas industry playing ... | The oil and gas industry, such as Exxon Mobil,... |
| Heat Pumps β Page 2 β pv magazine International: The Gothenburg district court in Sweden has charged eight people for allegedly stealing nearly $ 416,000 of air source heat pumps, geothermal heat pumps, white goods and tools from multiple locations in the western part of the country. Samsung and SMA are using a new cloud-to-cloud system that allows PV systems with SMA inverters to be integrated with Samsung heat pumps. Toshiba Carrier has been recognized at the 2023 National Invention Awards of Japan for its innovative discharge port structure in multi-cylinder rotary piston compressors for heat pumps. The technology tackles the problem of overheating, resulting in improved heating capacity and efficiency. More than 20 companies, governments, and nongovernmental organizations have presented EU Energy Commissioner Kadri Simson with a roadmap for the European heat pump sector, including recommended solutions to overcome barriers to growth. Germanyβ s MAN Energy Solutions has supplied two 50 MW seawater heat | pv-magazine.com/category/heat-pumps/page/2 | What innovative solutions are being developed ... | One innovative solution being developed in the... |
question_length = {
"human": human_eval_df["question"].map(len),
"synthetic": synthetic_eval_df["question"].map(len)
}
sns.histplot(question_length, kde=True)
plt.title("Question Length Distribution")
plt.xlabel("Question Length")
plt.ylabel("Count")
plt.show()
eval_df = pd.concat([human_eval_df, synthetic_eval_df], ignore_index=True)
eval_df["is_synthetic"] = eval_df["relevant_section"].isna()
eval_df["is_synthetic"].value_counts()
is_synthetic True 25 False 23 Name: count, dtype: int64
Now we have doubled the number of questions and answers. However, we can see that our synthetic questions are slightly longer than the provided questions which could mean that they are slightly easier to answer. This potential bias should be taken into account when evaluating the pipeline.
RAGAS provides a variety of metrics to evaluate the performance of a RAG pipeline. Here are some of the key metrics we will be using and how they are calculated:
For this to work we create a test dataset for each of our RAG pipelines that contains the evaluation questions and their ground truth answers. We then run all the questions through our RAG pipeline and store the generated answers and the retrieved chunks. We can then use this test dataset to calculate the RAGAS metrics.
datasets_folder = gold_folder / "datasets"
if not datasets_folder.exists():
datasets_folder.mkdir()
def get_or_create_eval_dataset(name: str, df: pd.DataFrame, chain: Chain) -> Dataset:
dataset_file = datasets_folder/ f"{name}_dataset.json"
if dataset_file.exists():
with open(dataset_file, "r") as file:
dataset = Dataset.from_dict(json.load(file))
print(f"Loaded {name} dataset from {dataset_file}")
else:
datapoints = {
"question": df["question"].tolist(),
"answer": [],
"contexts": [],
"ground_truth": df["ground_truth"].tolist(),
"context_urls": []
}
for question in tqdm(datapoints["question"]):
result = chain.invoke(question)
datapoints["answer"].append(result["answer"])
datapoints["contexts"].append([str(doc.page_content) for doc in result["context"]])
datapoints["context_urls"].append([doc.metadata["url"] for doc in result["context"]])
dataset = Dataset.from_dict(datapoints)
with open(dataset_file, "w") as file:
json.dump(dataset.to_dict(), file)
print(f"Saved {name} dataset to {dataset_file}")
return dataset
results_folder = gold_folder / "results"
if not results_folder.exists():
results_folder.mkdir()
def get_or_run_llm_eval(name: str, dataset: Dataset, llm_judge_model: LLM) -> pd.DataFrame:
eval_results_file = results_folder / f"{name}_llm_eval_results.csv"
if eval_results_file.exists():
eval_results = pd.read_csv(eval_results_file)
print(f"Loaded {name} evaluation results from {eval_results_file}")
else:
eval_results = evaluate(dataset,
metrics=[faithfulness, answer_relevancy, context_relevancy, answer_correctness],
is_async=True,
llm=llm_judge_model,
embeddings=embedding_models["gte"],
run_config=RunConfig(
timeout=60, max_retries=10, max_wait=60, max_workers=8),
).to_pandas()
eval_results.to_csv(eval_results_file, index=False)
print(f"Saved {name} evaluation results to {eval_results_file}")
return eval_results
def plot_llm_eval(name: str, eval_results: pd.DataFrame):
# select only the float64 columns (assuming these are the RAGAS metrics)
ragas_metrics_data = (eval_results
.select_dtypes(include=[np.float64]))
# boxplot of distributions
sns.boxplot(data=ragas_metrics_data, palette="Set2")
plt.title(f'{name}: Distribution of RAGAS Evaluation Metrics')
plt.ylabel('Scores')
plt.xlabel('Metrics')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# barplot of means
means = ragas_metrics_data.mean()
plt.figure(figsize=(14, 8))
sns.barplot(x=means.index, y=means, palette="Set2")
plt.title(f'{name}: Mean of RAGAS Evaluation Metrics')
plt.ylabel('Mean Scores')
plt.xlabel('Metrics')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
def plot_multiple_evals(eval_results: Dict[str, pd.DataFrame]):
# combine the results
full_results = []
for name, results in eval_results.items():
results['name'] = name
full_results.append(results)
full_results = pd.concat(full_results, ignore_index=True)
full_results = full_results.sort_values(by='name')
# select only the float64 columns (assuming these are the RAGAS metrics)
ragas_metrics_data = full_results.select_dtypes(include=[np.float64])
ragas_metrics_data['name'] = full_results['name']
# boxplot of distributions
plt.figure(figsize=(14, 8))
sns.boxplot(x='variable', y='value', hue='name', data=pd.melt(ragas_metrics_data, id_vars='name'), palette="Set2")
plt.title('Distribution of RAGAS Evaluation Metrics by Model')
plt.ylabel('Scores')
plt.xlabel('Metrics')
plt.xticks(rotation=45)
plt.legend(title='Model')
plt.tight_layout()
plt.show()
# barplot of means
means = ragas_metrics_data.groupby('name').mean().reset_index()
means_melted = pd.melt(means, id_vars='name')
plt.figure(figsize=(14, 8))
sns.barplot(x='variable', y='value', hue='name', data=means_melted, palette="Set2")
plt.title('Mean of RAGAS Evaluation Metrics by Model')
plt.ylabel('Mean Scores')
plt.xlabel('Metrics')
plt.xticks(rotation=45)
plt.legend(title='Model')
plt.tight_layout()
plt.show()
selected_dataset = get_or_create_eval_dataset("selected", eval_df, selected_chain)
Loaded selected dataset from data/gold/datasets/selected_dataset.json
selected_llm_eval_results = get_or_run_llm_eval("selected", selected_dataset, llm)
plot_llm_eval("selected", selected_llm_eval_results)
Loaded selected evaluation results from data/gold/results/selected_llm_eval_results.csv
datasets = {}
for name, chain in chains.items():
datasets[name] = get_or_create_eval_dataset(name, eval_df, chain)
Loaded mini_recursive_256 dataset from data/gold/datasets/mini_recursive_256_dataset.json Loaded mini_recursive_1024 dataset from data/gold/datasets/mini_recursive_1024_dataset.json Loaded mini_semantic dataset from data/gold/datasets/mini_semantic_dataset.json Loaded bge-m3_recursive_256 dataset from data/gold/datasets/bge-m3_recursive_256_dataset.json Loaded bge-m3_recursive_1024 dataset from data/gold/datasets/bge-m3_recursive_1024_dataset.json Loaded bge-m3_semantic dataset from data/gold/datasets/bge-m3_semantic_dataset.json Loaded gte_recursive_256 dataset from data/gold/datasets/gte_recursive_256_dataset.json Loaded gte_recursive_1024 dataset from data/gold/datasets/gte_recursive_1024_dataset.json Loaded gte_semantic dataset from data/gold/datasets/gte_semantic_dataset.json
llm_results = {}
for dataset_name, dataset in datasets.items():
llm_results[dataset_name] = get_or_run_llm_eval(dataset_name, dataset, llm)
Loaded mini_recursive_256 evaluation results from data/gold/results/mini_recursive_256_llm_eval_results.csv Loaded mini_recursive_1024 evaluation results from data/gold/results/mini_recursive_1024_llm_eval_results.csv Loaded mini_semantic evaluation results from data/gold/results/mini_semantic_llm_eval_results.csv Loaded bge-m3_recursive_256 evaluation results from data/gold/results/bge-m3_recursive_256_llm_eval_results.csv Loaded bge-m3_recursive_1024 evaluation results from data/gold/results/bge-m3_recursive_1024_llm_eval_results.csv Loaded bge-m3_semantic evaluation results from data/gold/results/bge-m3_semantic_llm_eval_results.csv Loaded gte_recursive_256 evaluation results from data/gold/results/gte_recursive_256_llm_eval_results.csv Loaded gte_recursive_1024 evaluation results from data/gold/results/gte_recursive_1024_llm_eval_results.csv Loaded gte_semantic evaluation results from data/gold/results/gte_semantic_llm_eval_results.csv
plot_multiple_evals(llm_results)
From the evaluation we can see that the RAG pipeline using the GTE embedding model by alibaba and recursive chunking with a chunk size of 1024 has the best performance. This is likely due to the fact that the GTE embedding model is the most powerful and the recursive chunking with a chunk size of 1024 provides the most context to the LLM.
best_collection = collections["gte_recursive_1024"]
best_store = collection_to_store("gte_recursive_1024", embedding_models["gte"])
In this final section we will look at some more advanced methods to improve our RAG pipeline and comparing them to our best performing pipeline.
Multi-querying is a technique that involves querying the retrieval model with multiple questions to retrieve relevant chunks. This approach can enhance the retrieval process by leveraging the diversity of queries to capture a broader range of relevant information. By combining the results from multiple queries, we can potentially improve the quality of the retrieved chunks and, consequently, the generated responses. When creating these additional queries the goal is to create queries that are different from the original query but still relevant to the user's information need, i.e variations of the original query.

def generate_query_variations(query: str, num_additional_queries: int) -> List[str]:
multiquery_prompt = """You are an assistant tasked with generating {num_queries} \
different versions of the given user question to retrieve relevant documents from a vector \
database. By generating multiple perspectives on the user question and breaking it down, your goal is to help \
the user overcome some of the limitations of the distance-based similarity search. \
Provide these alternative questions separated by newlines without any numbering or listing.
Original question: {question}
Alternatives:
"""
multiquery_chain = ChatPromptTemplate.from_template(multiquery_prompt) | llm
return multiquery_chain.invoke({"question": query, "num_queries": num_additional_queries}).content.split("\n")
def plot_multiquery_retrieval_results(query: str, collection : Collection, num_additional_queries: int = 3, num_results: int = 3):
vectors = get_vectors_from_collection(collection)
umap_transform = fit_umap(vectors)
vectors_projections = project_embeddings(vectors, umap_transform)
query_projections = project_embeddings(collection._embedding_function([query]), umap_transform)
query_variations = generate_query_variations(query, 5)
query_variations_projections = project_embeddings(collection._embedding_function(query_variations), umap_transform)
original_relevant_docs = collection.query(
query_texts=[query],
n_results=num_results,
)
original_relevant_docs_ids = [item for sublist in original_relevant_docs["ids"] for item in sublist] # flatten
original_relevant_docs_embeddings = collection.get(include=["embeddings"], ids=original_relevant_docs_ids)["embeddings"]
original_relevant_docs_projections = project_embeddings(original_relevant_docs_embeddings, umap_transform)
additional_relevant_docs = collection.query(
query_texts=query_variations,
n_results=num_results,
)
additional_relevant_docs_ids = [item for sublist in additional_relevant_docs["ids"] for item in sublist] # flatten
# remove duplicates
additional_relevant_docs_ids = list(set(additional_relevant_docs_ids))
# remove the original relevant docs from the additional relevant docs
additional_relevant_docs_ids = [doc_id for doc_id in additional_relevant_docs_ids if doc_id not in original_relevant_docs_ids]
additional_relevant_docs_embeddings = collection.get(include=["embeddings"], ids=additional_relevant_docs_ids)["embeddings"]
additional_relevant_docs_projections = project_embeddings(additional_relevant_docs_embeddings, umap_transform)
fig = go.Figure()
fig.add_trace(go.Scatter(x=vectors_projections[:, 0], y=vectors_projections[:, 1], mode='markers', marker=dict(size=5), name="other vectors"))
fig.add_trace(go.Scatter(x=query_projections[:, 0], y=query_projections[:, 1], mode='markers', marker=dict(size=7, color='black', symbol='x'), name="original query"))
fig.add_trace(go.Scatter(x=query_variations_projections[:, 0], y=query_variations_projections[:, 1], mode='markers', marker=dict(size=7, color='red', symbol='x'), name="query variations"))
fig.add_trace(go.Scatter(x=original_relevant_docs_projections[:, 0], y=original_relevant_docs_projections[:, 1], mode='markers', marker=dict(size=7, color='orange'), name="original relevant docs"))
fig.add_trace(go.Scatter(x=additional_relevant_docs_projections[:, 0], y=additional_relevant_docs_projections[:, 1], mode='markers', marker=dict(size=7, color='green'), name="additional relevant docs"))
fig.show()
plot_multiquery_retrieval_results("Climate Change", selected_collection)
class MultiQueryRetriever(BaseRetriever):
store: VectorStore
num_additional_queries: int = 3
num_results: int = 3
def _get_query_variations(self, query: str) -> List[str]:
return generate_query_variations(query, self.num_additional_queries)
def _get_relevant_documents(
self, original_query: str, *, run_manager: CallbackManagerForRetrieverRun
) -> List[Document]:
queries = self._get_query_variations(original_query)
queries.append(original_query)
retriever = store_to_retriever(self.store, k=self.num_results)
relevant_docs = []
for query in queries:
results = retriever.invoke(query, run_manager=run_manager)
# remove duplicates
for res in results:
if res not in relevant_docs:
relevant_docs.append(res)
return relevant_docs
multiquery_retriever = MultiQueryRetriever(store=best_store, num_additional_queries=3, num_results=3)
multiquery_chain = create_qa_chain(multiquery_retriever)
multiquery_chain.invoke("Where are the biggest increases in wildfire smoke exposure in recent years?")
{'context': [Document(page_content='the biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas. In addition to their impact on air quality, wildfires can disrupt processes that maintain access to drinking water, including by reducing the ability of soil to absorb water when it rains and sending additional sediment into drinking water systems. The science is clear that climate change is increasing whatβ s known as the β vapor pressure deficit, β or VPD, across western North America. When VPD is high, the atmosphere can pull more water out of plants, which dries them out and makes them more likely to burn. VPD also is also a good metric for drought, including the long 21st century drought the region has been experiencing. But climate isnβ t the only factor behind the westβ s worsening wildfires. More than a century of aggressive fire suppression and an even longer period of settler colonial repression of Indigenous burning practices have led to forests that are too dense, too', metadata={'domain': 'cleantechnica', 'id': 58, 'title': 'What Does A β Normal β Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'}),
Document(page_content='Blue River, Vida, Phoenix, and Talentβwere lost to the so-called Labor Day Fires in 2020. And in 2021, the Lytton Creek Fire wiped out the village of Lytton, British Columbia, destroying hundreds of homes. All told, between 2017 and 2021, nearly 120,000 fires burned across western North America, burning nearly 39 million acres of land and claiming more than 60,000 structures. The impacts of wildfires reach well beyond the people, communities, and ecosystems that are directly affected by flames. Wildfires have consequences for public health, water supplies, and economies long after a fire is extinguished. Mounting research is showing exposure to the fine particulate matter in wildfire smoke is responsible for thousands of indirect deaths, increases to the risk of pre-term birth among pregnant women, and even an increase in the risk of COVID-19 illness and death. Surprisingly, some of the biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas. In', metadata={'domain': 'cleantechnica', 'id': 57, 'title': 'What Does A β Normal β Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'}),
Document(page_content='parts of Washington, Oregon, Idaho, and Nevada. Some scientists are also raising concerns that all the young grasses and other plants that have sprung up as a result of the wet weather could quickly turn into dry kindling for wildfires as the dry season wears on into late summer and fall. According to the latest wildland fire outlook, most of the western United States is expected to experience either normal or below-normal fire activity between May and August this year. Source: National Interagency Fire Center. There are many different ways to measure wildfire activity, but by almost any metric, wildfires across the western US and southwestern Canada are worsening. Reliable, consistent wildfire metrics across the region started to become available in the mid-1980s. Hereβ s what the trends show. From 1984 to 1999, the region experienced an average of roughly 230 fires per year. From 2000 to 2021, the average was more than 350 fires per year. The number of wildfires larger than 1,000 acres in western North', metadata={'domain': 'cleantechnica', 'id': 52, 'title': 'What Does A β Normal β Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'}),
Document(page_content='even longer period of settler colonial repression of Indigenous burning practices have led to forests that are too dense, too uniform in their species, and without the resistance to fire they once had. Moreover, a lack of affordable housing across the region and the desire for proximity to beautiful, natural places has led to large increases in the number of people living in wildfire-prone areas. Recent research has found that human activities were responsible for starting more than 80% of all wildfires in the United States, while also increasing the length of the fire season. This yearβ s wildfire season may offer the western US a chance to catch its breath after several years of record-breaking fires. But with climate change expected to deepen the hot, dry conditions that enable such record-breaking fires, we must be preparing for a future with even more fire. Carly Phillips contributed to this post. We publish a number of guest posts from experts in a large variety of fields. This is our contributor', metadata={'domain': 'cleantechnica', 'id': 59, 'title': 'What Does A β Normal β Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'})],
'question': 'Where are the biggest increases in wildfire smoke exposure in recent years?',
'answer': 'The biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas.'}
datasets["multiquery"] = get_or_create_eval_dataset("multiquery", eval_df, multiquery_chain)
Loaded multiquery dataset from data/gold/datasets/multiquery_dataset.json
llm_results["multiquery"] = get_or_run_llm_eval("multiquery", datasets["multiquery"], llm)
Loaded multiquery evaluation results from data/gold/results/multiquery_llm_eval_results.csv
strategy_results = {}
strategy_results["gte_recursive_1024"] = llm_results["gte_recursive_1024"]
strategy_results["multiquery"] = llm_results["multiquery"]
plot_multiple_evals(strategy_results)
We can see that on average the answer correctness does slightly increase when using multi-querying. This is likely due to the fact that the retrieval process is more robust and can capture a broader range of relevant information. However, the faithfullness and context_relevancy decrease could be due to the multi-querying introducing more noise into the retrieval process by retrieving more chunks in general and some of them being less relevant.
The idea of the HyDE method is to generate hypothetical documents that are similar to the user query and then retrieve the most similar chunks to these hypothetical documents. This can be useful when the user query is not very specific or when the user query is not very similar to the chunks. The HyDE method can be used to generate hypothetical documents that are more similar to the chunks and therefore improve the retrieval process. Another way to think about it is generating a hypothetical answer and therby reaching an area in the embedding space that is more similar to the actual answer which might not be reachable from the user query.

def generate_hypothetical_document(query: str, num_hypotheses: int) -> List[str]:
hyde_prompt = """Please write a news passage about the topic.
Topic: {query}
Passage:
"""
hyde_chain = ChatPromptTemplate.from_template(hyde_prompt) | llm
hypothetical_documents = [hyde_chain.invoke({"query": query}).content for _ in range(num_hypotheses)]
return hypothetical_documents
def plot_hyde_retrieval_results(query: str, collection : Collection, num_hypo_documents: int = 2, num_results: int = 3):
vectors = get_vectors_from_collection(collection)
umap_transform = fit_umap(vectors)
vectors_projections = project_embeddings(vectors, umap_transform)
query_projections = project_embeddings(collection._embedding_function([query]), umap_transform)
hypothetical_documents = generate_hypothetical_document(query, num_hypo_documents)
query_variations_projections = project_embeddings(collection._embedding_function(hypothetical_documents), umap_transform)
original_relevant_docs = collection.query(
query_texts=[query],
n_results=num_results,
)
original_relevant_docs_ids = [item for sublist in original_relevant_docs["ids"] for item in sublist] # flatten
original_relevant_docs_embeddings = collection.get(include=["embeddings"], ids=original_relevant_docs_ids)["embeddings"]
original_relevant_docs_projections = project_embeddings(original_relevant_docs_embeddings, umap_transform)
additional_relevant_docs = collection.query(
query_texts=hypothetical_documents,
n_results=num_results,
)
additional_relevant_docs_ids = [item for sublist in additional_relevant_docs["ids"] for item in sublist] # flatten
# remove duplicates
additional_relevant_docs_ids = list(set(additional_relevant_docs_ids))
# remove the original relevant docs from the additional relevant docs
additional_relevant_docs_ids = [doc_id for doc_id in additional_relevant_docs_ids if doc_id not in original_relevant_docs_ids]
additional_relevant_docs_embeddings = collection.get(include=["embeddings"], ids=additional_relevant_docs_ids)["embeddings"]
additional_relevant_docs_projections = project_embeddings(additional_relevant_docs_embeddings, umap_transform)
fig = go.Figure()
fig.add_trace(go.Scatter(x=vectors_projections[:, 0], y=vectors_projections[:, 1], mode='markers', marker=dict(size=5), name="other vectors"))
fig.add_trace(go.Scatter(x=query_projections[:, 0], y=query_projections[:, 1], mode='markers', marker=dict(size=7, color='black', symbol='x'), name="original query"))
fig.add_trace(go.Scatter(x=query_variations_projections[:, 0], y=query_variations_projections[:, 1], mode='markers', marker=dict(size=7, color='red', symbol='x'), name="hypothetical documents"))
fig.add_trace(go.Scatter(x=original_relevant_docs_projections[:, 0], y=original_relevant_docs_projections[:, 1], mode='markers', marker=dict(size=7, color='orange'), name="original relevant docs"))
fig.add_trace(go.Scatter(x=additional_relevant_docs_projections[:, 0], y=additional_relevant_docs_projections[:, 1], mode='markers', marker=dict(size=7, color='green'), name="additional relevant docs"))
fig.show()
plot_hyde_retrieval_results("Climate Change", selected_collection)
class HyDERetriever(BaseRetriever):
store: VectorStore
num_hypo_documents: int = 2
num_results: int = 3
def _get_hypothetical_documents(self, query: str) -> List[str]:
return generate_hypothetical_document(query, self.num_hypo_documents)
def _get_relevant_documents(
self, original_query: str, *, run_manager: CallbackManagerForRetrieverRun
) -> List[Document]:
hypothetical_documents = self._get_hypothetical_documents(original_query)
hypothetical_documents.append(original_query)
retriever = store_to_retriever(self.store, k=self.num_results)
relevant_docs = []
for query in hypothetical_documents:
results = retriever.invoke(query, run_manager=run_manager)
# remove duplicates
for res in results:
if res not in relevant_docs:
relevant_docs.append(res)
return relevant_docs
hyde_retriever = HyDERetriever(store=best_store, k=3)
hyde_chain = create_qa_chain(hyde_retriever)
hyde_chain.invoke("Where are the biggest increases in wildfire smoke exposure in recent years?")
{'context': [Document(page_content='the biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas. In addition to their impact on air quality, wildfires can disrupt processes that maintain access to drinking water, including by reducing the ability of soil to absorb water when it rains and sending additional sediment into drinking water systems. The science is clear that climate change is increasing whatβ s known as the β vapor pressure deficit, β or VPD, across western North America. When VPD is high, the atmosphere can pull more water out of plants, which dries them out and makes them more likely to burn. VPD also is also a good metric for drought, including the long 21st century drought the region has been experiencing. But climate isnβ t the only factor behind the westβ s worsening wildfires. More than a century of aggressive fire suppression and an even longer period of settler colonial repression of Indigenous burning practices have led to forests that are too dense, too', metadata={'domain': 'cleantechnica', 'id': 58, 'title': 'What Does A β Normal β Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'}),
Document(page_content='Blue River, Vida, Phoenix, and Talentβwere lost to the so-called Labor Day Fires in 2020. And in 2021, the Lytton Creek Fire wiped out the village of Lytton, British Columbia, destroying hundreds of homes. All told, between 2017 and 2021, nearly 120,000 fires burned across western North America, burning nearly 39 million acres of land and claiming more than 60,000 structures. The impacts of wildfires reach well beyond the people, communities, and ecosystems that are directly affected by flames. Wildfires have consequences for public health, water supplies, and economies long after a fire is extinguished. Mounting research is showing exposure to the fine particulate matter in wildfire smoke is responsible for thousands of indirect deaths, increases to the risk of pre-term birth among pregnant women, and even an increase in the risk of COVID-19 illness and death. Surprisingly, some of the biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas. In', metadata={'domain': 'cleantechnica', 'id': 57, 'title': 'What Does A β Normal β Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'}),
Document(page_content='Letβ s dive into western wildfires by the numbers. As spring turns to summer and the days warm up, the Northern Hemisphere enters the period known as Danger Season, when wildfires, heat waves, and hurricanes, all amplified by climate change, begin to ramp up. In the western United States, the start of Danger Season is marked by the shift from the wintertime wet season to the summertime dry season. While wildfires can and do occur all year round, this shift from cool and wet to warm and dry marks the start of wildfire season in the region. According to the latest seasonal outlook from the National Interagency Fire Center, the exceptionally rainy and snowy conditions the west experienced during the winter of 2022-2023 are translating to below-average to normal levels of wildfire risk across most western states at least through August. That said, above-normal activity is expected for parts of Washington, Oregon, Idaho, and Nevada. Some scientists are also raising concerns that all the young grasses and other', metadata={'domain': 'cleantechnica', 'id': 51, 'title': 'What Does A β Normal β Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'}),
Document(page_content='even longer period of settler colonial repression of Indigenous burning practices have led to forests that are too dense, too uniform in their species, and without the resistance to fire they once had. Moreover, a lack of affordable housing across the region and the desire for proximity to beautiful, natural places has led to large increases in the number of people living in wildfire-prone areas. Recent research has found that human activities were responsible for starting more than 80% of all wildfires in the United States, while also increasing the length of the fire season. This yearβ s wildfire season may offer the western US a chance to catch its breath after several years of record-breaking fires. But with climate change expected to deepen the hot, dry conditions that enable such record-breaking fires, we must be preparing for a future with even more fire. Carly Phillips contributed to this post. We publish a number of guest posts from experts in a large variety of fields. This is our contributor', metadata={'domain': 'cleantechnica', 'id': 59, 'title': 'What Does A β Normal β Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'})],
'question': 'Where are the biggest increases in wildfire smoke exposure in recent years?',
'answer': 'The biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas.'}
datasets["hyde"] = get_or_create_eval_dataset("hyde", eval_df, hyde_chain)
Loaded hyde dataset from data/gold/datasets/hyde_dataset.json
llm_results["hyde"] = get_or_run_llm_eval("hyde", datasets["hyde"], llm)
Loaded hyde evaluation results from data/gold/results/hyde_llm_eval_results.csv
strategy_results["hyde"] = llm_results["hyde"]
plot_multiple_evals(strategy_results)
Just like with multi-querying we can see that the answer correctness increases when using the HyDE method.
There are many other methods that can be used to improve the RAG pipeline. Some of these include:
os.system("jupyter nbconvert --to html --template pj cleantech_rag.ipynb")
0